Con2Mix: A semi-supervised method for imbalanced tabular security data 1

Abstract

Con2Mix (Contrastive Double Mixup) is a new semi-supervised learning methodology that innovates a triplet mixup data augmentation approach for finding code vulnerabilities in imbalanced, tabular security data sets. Tabular data sets in cybersecurity domains are widely known to pose challenges for machine learning because of their heavily imbalanced data (e.g., a small number of labeled attack samples buried in a sea of mostly benign, unlabeled data). Semi-supervised learning leverages a small subset of labeled data and a large subset of unlabeled data to train a learning model. While semi-supervised methods have been well studied in image and language domains, in security domains they remain underutilized, especially on tabular security data sets which pose especially difficult contextual information loss and balance challenges for machine learning. Experiments applying Con2Mix to collected security data sets show promise for addressing these challenges, achieving state-of-the-art performance on two evaluated data sets compared with other methods.

Keywords

Semi-supervised learning contrastive learning tabular data sets security data sets

1. Introduction

Supervised learning [31,36,37,43] has shown great success with large, labeled data sets collected and annotated by researchers in image, language, and many other domains. However, in real-world cybersecurity domains, the task of accurately labeling large data sets for supervised learning is often infeasible. For example, security-critical software products, such as operating systems, web servers, cloud computing architectures, and networking stacks, often consist of hundreds of millions of lines of code, and undergo hundreds or thousands of code changes per day (churn) as features are added, bugs are corrected, and hardware evolves. Accurately identifying and labeling even one security vulnerability in these large, mutating corpa often requires extremely high levels of expertise and many hundreds of person-hours of effort. As a result, previously undiscovered (zero-day) security vulnerabilities are so valuable and rare that some have sold for up to $2.5 million USD2

²
https://zerodium.com

to bug bounty programs and on the black market.

In order to aid defenders in analyzing such data sets before vulnerabilities become exploited, it is essential to efficiently leverage the small portion of the data set that has been labeled by experts, and train the model to handle the large amount of unlabeled data. Semi-supervised learning utilizes both labeled and unlabeled data to address this dilemma. In recent years, many semi-supervised methods [2,14,50,63] have emerged for different domains, bridging the gap between supervised learning and unsupervised learning. However, in security domains, semi-supervised methods on tabular security data sets remain relatively unexplored. This open problem has impeded cybersecurity research and practice, in that state-of-the-art software vulnerability detection approaches often suffer extremely high false positive rates (making them unusable), miss many vulnerabilities (making them unsafe), or cannot use machine learning approaches at all (posing scalability and automation problems).

Nevertheless, partially labeled data sets for this domain are widely available in the form of Common Vulnerabilities and Exposures (CVE)3

https://cve.mitre.org

and National Vulnerability Databases (NVDs)4

⁴

https://nvd.nist.gov

accumulated by the security community for decades for large, widely used software products. These databases are not exhaustive—they represent only the small subset of software security vulnerabilities that have been discovered and documented for a relatively small subset of all software products. However, the vulnerability instances collected in these data sets are widely considered to be representative of the vast, unknown expanse of yet-to-be-discovered vulnerabilities. For example, Common Weakness Enumerations (CWEs)5

⁵

https://cwe.mitre.org

represent informal efforts to document categories and patterns in these databases in order to document lessons learned and help programmers avoid mistakes conducive to software compromise. This data, therefore, constitutes a heavily imbalanced yet conceptually rich source of information about real-world software vulnerabilities.

There are many semi-supervised learning techniques in image and language domains due to the ease of data augmentation in these data sets. However, in security domains, data are usually in a tabular format. This impedes augmentation because the data lack context, causing significant information loss after augmentation. Tabular security data sets are also highly imbalanced. For example, the class ratio of the data set we use is 1:100 (positive:negative) for data set 1 and 1:50 (positive:negative) for data set 2, where positive means vulnerable cases and negative means non-vulnerable cases. This huge imbalance issue is difficult to solve with traditional down-sampling and up-sampling methods (e.g., [5]). To address this problem, researchers have proposed a variety of approaches. For example, set convolution (SetConv) [22] and episodic training have been proposed to extract a single representative for each class, so that classifiers can later be trained on a balanced class distribution. Focal Loss [35] addresses this class imbalance by reshaping the standard cross entropy loss such that it down-weights the loss assigned to well-classified examples, focusing the training on a sparse set of hard examples and preventing the vast number of easy negatives from overwhelming the detector during training. Class-balanced Loss [12] is a re-weighting method using the effective number of samples for each class to re-balance the loss.

As an alternative to these methods, we propose a novel triplet mixup data augmentation method to address the class imbalance problem by only augmenting the minority data. Triplet mixup goes beyond the traditional pairwise mixup, mixing three different data points from the same class, thereby joining the information from the features of the triple data to a larger extent. Experimental results demonstrate the effectiveness of our proposed data augmentation method over the prior approaches. In addition to doing mixup on the input, we also do manifold mixup in the hidden space on each pair of embeddings to create multiple views for our contrastive loss term. We further leverage the unlabeled subset by pre-training the encoder and using label propagation [28] to generate pseudo-labels for the unlabeled samples. Subsequently, the trained encoder and samples, for which we have generated pseudo-labels, are transferred to a downstream task where a simple predictor with Mixup [66] augmentation is trained.

The contributions of our work are fourfold: (1) We propose a novel triplet mixup data augmentation method that can reduce the data imbalance problem in large, sparsely labeled data sets. (2) Our work is the first semi-supervised learning framework on tabular security data sets. (3) We develop a tool to automatically extract source-level function features from sources processed by Joern [59] and binary-level function features from binaries produced by C compilers. (4) We achieve state-of-the-art performance on all the experiments, which demonstrate the superiority of our proposed method for cybersecurity data domains.

2. Background

2.1. Vulnerability complexity

Listing 1.

A buffer overflow vulnerability in a PHP interpreter (CVE-2015-3329)

Software vulnerability detection is widely considered to be a difficult problem requiring high levels of human expertise and training. Therefore, providing defenders with more powerful tools to help them find these subtle but dangerous vulnerabilities in large bodies of complex code is an increasingly acute need in the software industry.

Listing 1 shows an example of a vulnerable function (CVE 2015-3329) in the Linux PHP interpreter. Despite having a small number of lines without loops or conditionals, this bug evaded auditors for over two years, leaving affected Linux machines susceptible to remote compromise until it was detected and patched in April 2015. The flaw involves unsafely copying a file name with attacker-controllable length into a buffer, potentially overflowing it and corrupting adjacent memory. When adjacent memory includes code pointers, the hijacker can take remote control of the program. Although many standard code complexity metrics (e.g., number of lines-of-code, looping structures, and conditions) misidentify this code site as unlikely to contain a vulnerability, additional features such as pointer arguments, distance to insecure libc functions, and pointer assignments might focus more auditor attention on this function.

Listing 2.

An integer overflow vulnerability in OpenSSH v3.3 (CVE-2002-0639)

Listing 2 shows a classic example of integer overflow extracted from OpenSSH v3.3 (CVE-2002-0639). When nresp is 1073741824 and sizeof(char*) is 4, multiplying them overflows and results in 0, causing the xmalloc call to allocate no memory, and allowing the subsequent loop iterations to overflow the heap buffer. This illustrates how looping structures and if-else conditions are susceptible to human errors, such as missing conditions and wrong bound checks. In addition, this snippet is located near external input source packet_get_int. The distance to the external inputs is important because those inputs must be sanitized before flowing deeper into the program logic.

All these metrics intensify the vulnerability complexity and therefore can be used as alarming indicators (features). Such complexity metrics are used to approximate the vulnerability proneness by measuring how coupled the candidate function is both internally (e.g. variables, pointers, etc.) and externally (e.g. fan-in, fan-out, height, etc.). The features used in this research are further discussed in §2.2.

2.2. Data set and challenges

Previous work has considered the software vulnerability detection problem at various levels of granularity, including component-level granularity [23,65,74], source file-level granularity [17,39,48], and function-level granularity [64]. Finer granularities are more useful because they provide code programmers and auditors more precise information on where vulnerabilities might be located and how to patch them against exploitation; but they are often more difficult to achieve in practice because of the much greater level of human skill and time typically required to localize precise bug positions—especially for bugs that involve complex interactions between multiple components scattered throughout the code. This leads to greater data imbalance challenges, since finer-granularity labels are more difficult to produce. Accurately modeling software vulnerabilities therefore requires a suitably large data set. Moreover, the data set must usually consist of real-world applications and real-world bugs rather than synthetic data in order to yield a model that is effective for realistic vulnerability-finding tasks. Although the synthetic dataset such as Juliet [3], includes a simple, compact and, minimal code required to compile a vulnerability, it is far off from replicating a complicated and diverse real world vulnerability [9,70].

Building a large corpus of data using real-world applications is challenging. We used NVD archives to compile a labeled data set with a reasonable vulnerable-to-non-vulnerable ratio in records. The NVD database includes URLs with patches or exploit tags, but does not reveal which portions of each patch are relevant to the exploit, or which original code lines were exploited. Inferring the respective files, functions, and locations of the defective code is a manual effort requiring many person-hours. Automating this step by filtering out irrelevant code without harming the actual imbalance ratio is infeasible. We collected a suitably large corpus by manually inspecting 8 real-world applications [57] designed for variety of different purposes and whose known vulnerabilities are recorded within the NVD.

Table 1 shows the descriptions of the applications in our data set. Each is a large-scale open-source project written in C/C++ containing hundreds or thousands of functions, and tens or hundreds of thousands of lines of code. By manually studying all NVD records for these applications over the past decade, we assembled a data set of over 400 vulnerabilities, the exact bug locations that gave rise to each vulnerability, and the type of attack that exploited each vulnerability and that motivated the patch.

Previous work shows that the vulnerability likelihood is proportional to the software complexity [47]. The complexity may be increased by vulnerability metrics such as number of lines-of-code, number of variable, and nested structures etc. Different code features may be associated with different vulnerabilities, hence we categorize them into four different dimensions. Table 2 summarizes the four dimensions of features: structure-based, flow-based, pointer-based, and binary-based. These dimensions are inspired by prior works that have shown the effectiveness of using different aspects of software structures for modeling and discovering vulnerabilities [18,59]. In particular, the different dimensions are useful for detecting different types of vulnerabilities. For example, flow-based features tend to be associated with component interaction errors, wherein software components make inconsistent, conflicting assumptions about prerequisites or interfaces; whereas pointer-based features tend to be associated with incorrectly computed references to data or memory.

Table 1
Applications

Application Description

Sudo Enables users to run programs with the security privileges of another user

Poftpd A simple ftp server that provide high configurable features

Libtiff Library for reading and writing Tagged Image File Format (TIFF)

Libpng Library for reading, creating and manipulating Portable Network Graphics (PNG)

Freetype Library used to render text into bitmaps

TinTin A console telnet client for playing MUD (Multi-User Dimension)

Tcpdump Data network packet analyzer

Openssh A secure networking utility works on Secure Shell protocol

Application	Description
Sudo	Enables users to run programs with the security privileges of another user
Poftpd	A simple ftp server that provide high configurable features
Libtiff	Library for reading and writing Tagged Image File Format (TIFF)
Libpng	Library for reading, creating and manipulating Portable Network Graphics (PNG)
Freetype	Library used to render text into bitmaps
TinTin	A console telnet client for playing MUD (Multi-User Dimension)
Tcpdump	Data network packet analyzer
Openssh	A secure networking utility works on Secure Shell protocol

Table 2

Features

Dimension	Feature	Description
Structure-based	Parameters	number of parameters [74]
	Cyclomatic Complexity	number of linearly independent paths [38]
	Loop Number	number of loops [48]
	Nesting Degree	maximum nesting level of control structures in a function [44]
	SLOC	number of source lines [21]
	Variables	number of local variables [48]
Flow based	In-degree	number of functions that call the corresponding function [41]
	Out-Degree	number of functions that called by the function [41]
	Height	distance to the closest external data input
Pointer based	Pointers	number of pointer variables
	Pointer Arguments	number of pointer arguments
	Pointer Assignments	number of pointer assignments
Binary based	ALOC	number of assembly codes
	Conditions	number of binary conditions
	Cmps	number of cmp instructions
	Jmps	number of jmp instructions

Structure based dimension reflects the complexity resulting from number of parameters, variables and lines-of-code. In addition, the cyclomatic complexity, loop number, and nesting degree can result in highly coupled and dependent code structures and can reflect certain types of vulnerabilities.

Flow based dimension captures the complexity due to data flow. In-degree and out-degree measures number of possible ways that data can pass through. The data from external inputs must be sanitized before use, otherwise a vulnerability can exist in the vicinity of such entries. The distance to the nearest external data input (height) reflects this feature.

Pointer based dimension represents the complexity arising from pointer uses. The number of pointers being passed into a function and number of pointer variables introduced inside the function captures the total pointer interactions. Pointer assignments/arithmetic are helpful in identifying buffer overflow-inducing flows as they are closely related to memory manipulation.

Binary based dimension captures the data complexities that might not be visible inside the source code. Compilers can modify the source representation by optimizing [1]. In addition, in-lined assembly code might not be correctly parsed by source parsers [59], and source-based features cannot capture these effects. Therefore, our research introduces binary-based metrics to represent the executable properties.

We collect source-level function features using Joern [59], which processes source files into code property graphs (CPGs) reflecting their syntax, control-flows, and dataflows. In addition to source level feature collection, we collect binary-based data in order to detect compiler-generated vulnerabilities that cannot be observed at the source code level. To our knowledge, ours is the first data set that combines detailed, fine-grained CVE data for both source and binary features. We developed a tool to automatically extract these features from sources processed by Joern and binaries produced by C compilers.

2.3. Data augmentation

Data augmentation is a method to increase training data diversity without directly collecting more data [20]. It is often employed to address data insufficiency problems. In image and language domains, data augmentation has been well studied [13,24,27,29,30,34,46,49,51,68]. There are a variety of data augmentation methods available in these domains. For example, in image domains researchers use zoom, flip, contrast adjustment, etc., and in language domains researchers use language translation, next words and previous words predictions, interpolation, etc. Dating back to 2002, SMOTE (Synthetic Minority Over-sampling Technique) [5] is an efficient up-sampling method for the minority class. In recent years, Mixup [66] has emerged as a popular data augmentation method for alleviating memorization and sensitivity of adversarial examples of large deep neural networks. It regularizes the neural network to favor simple linear behavior between training examples by training a neural network on convex combinations of pairs of examples and their labels. RLCN [56] utilizes limited samples for training a model and then applies it to classify normal instances and detect the emergence of novel classes over time. However, tabular data lacks context information, making it difficult to apply traditional data augmentation methods on tabular domains because it generates too much noise. To address this issue, MCoM [33] extends the idea of Mixup to Triplet Mixup, a data augmentation method. In this paper, we adopt the triplet mixup idea of MCoM and demonstrate its effectiveness with different experimental settings and different data sets.

2.4. Semi-supervised learning

Semi-supervised learning [14,26,32,33,58,60,63,67,69,71] is a machine learning branch using both labeled and unlabeled data to perform certain learning tasks. Generally, semi-supervised methods can be partitioned into two categories: inductive methods and transductive methods. Inductive methods aim to construct a classifier that can generate predictions for any object in the input space. A simple inductive approach called wrapper methods first trains classifiers on labeled data and then uses the predictions of the resulting classifiers to generate additionally labeled data [73]. The classifiers can then be re-trained on this pseudo-labeled data in addition to the existing labeled data. VIME [63] is a self- and semi-supervised learning framework that creates a novel pretext task of estimating mask vectors from corrupted tabular data in addition to the reconstruction pretext task for self-supervised learning. Contrastive Mixup [14] leverages mixup-based augmentation under the manifold assumption by mapping samples to a low-dimensional latent space and encouraging interpolated samples to have high similarity within the same-labeled class. They cannot be directly used in the security domain because they cannot handle the imbalanced data problem. Our method Con2Mix can be regarded as a wrapper method derived from Contrastive Mixup with an optimized Triplet Mixup data augmentation method which is the main innovation compared to Contrastive Mixup. Our method can better work for minority classes and is superior in security data sets.

2.5. Contrastive learning

Contrastive learning [10,11,42,54,55,61,62] aims to embed augmented versions of the same sample close to each other while separating embeddings from different samples. Recent methods such as SwAV [4], MoCo [25], and SimCLR [7] with modified approaches have produced results comparable to the state-of-the-art supervised method on the ImageNet [15] data set. Similarly, PIRL [40], Selfie [53], and the InfoMin principle [52] reflect the effectiveness of the pretext tasks being used and how they boost the performance of their models. Contrastive Learning has started gaining popularity on several NLP tasks in recent years. INFOXLM [8], a cross-lingual pretraining model, proposes a cross-lingual pretraining task based on maximizing the mutual information between two input sequences and learning to differentiate machine translation of input sequences using contrastive learning. Most of the popular language models, such as BERT [16] and GPT [45], approach pretraining on tokens and therefore may not capture sentence-level semantics. To address this issue, CERT [19] that pretrains models on the sentence level using contrastive learning was proposed. In this work, we extend contrastive learning to tabular domains by computing a contrastive loss in the embedding space.

2.6. Pros and cons of different baselines

Our method Con2Mix is a semi-supervised method. Thus, we compare our method with 6 supervised methods and two state-of-the-art semi-supervised methods. The supervised methods we choose are XGBoost, MLP, Logit Regression, SVM, Decision Tree, and KNN. The semi-supervised methods are VIME [63] and Contrastive Mixup [14]. Table 3 demonstrates the pros and cons of these different baselines. Compared to supervised methods, our method only utilizes a small portion of labeled data and achieves better results than supervised methods. Compared to VIME, our method works for both continuous and discrete data. Compared to Contrastive Mixup, we add one more data augmentation in the input layer, which make our model able to handle imbalanced data.

Table 3
Pros and cons of different baselines

Methods Pros Cons

Supervised Methods

XGBoost Effective with large data sets. Can overfit the data, especially if the trees are too deep with noisy data.

MLP Can overfit the data, especially if the trees are too deep with noisy data. Computations are difficult and time-consuming

Logit Regression Is easier to implement, and interpret, and very efficient to train. The assumption of linearity between the dependent and independent variables.

SVM It is effective in high dimensional spaces. It doesn’t perform very well, when the data set has more noise

Decision Tree Can work with numerical and categorical features. It tends to overfit. It can’t be used in big data.

KNN No Training Step. Outlier sensitivity

Semi-supervised Methods

VIME Proposing applicable and proper pretext tasks and augmentations for tabular data Limited to continuous data.

Contrastive Mixup Improvements on pre-text tasks or augmentation methods for tabular datasets Limiting the set of labeled samples that are initially used in the contrastive component.

Methods	Pros	Cons
Supervised Methods
XGBoost	Effective with large data sets.	Can overfit the data, especially if the trees are too deep with noisy data.
MLP	Can overfit the data, especially if the trees are too deep with noisy data.	Computations are difficult and time-consuming
Logit Regression	Is easier to implement, and interpret, and very efficient to train.	The assumption of linearity between the dependent and independent variables.
SVM	It is effective in high dimensional spaces.	It doesn’t perform very well, when the data set has more noise
Decision Tree	Can work with numerical and categorical features.	It tends to overfit. It can’t be used in big data.
KNN	No Training Step.	Outlier sensitivity
Semi-supervised Methods
VIME	Proposing applicable and proper pretext tasks and augmentations for tabular data	Limited to continuous data.
Contrastive Mixup	Improvements on pre-text tasks or augmentation methods for tabular datasets	Limiting the set of labeled samples that are initially used in the contrastive component.

3. Preliminaries

In order to introduce our proposed method, we first provide some formulas for contrastive loss and semi-supervised loss (supervised and unsupervised loss). These two kinds of losses are the foundation of our proposed method. Provided with a data set with N examples, we define $D_{L} = {(x_{i}, y_{i})}_{i = 1}^{N_{L}}$ as the labeled data set and $D_{U} = {(x_{i})}_{i = 1}^{N_{U}}$ as the unlabeled data set, where $x_{i}$ are the input features and $y_{i}$ are the discrete labels. In our case, the target task is binary classification, $y_{i} \in {0, 1}$ where 1 indicates the sample is positive (vulnerable) and 0 indicates it is negative (non-vulnerable). In supervised learning, our goal is to learn a classifier f to minimize the corresponding loss function l. In unsupervised learning, we will have a process called pseudo-labeling unsupervised data. Our goal is to learn a classifier $f_{u}$ to minimize the corresponding loss function $l_{u}$ on the pseudo label $y_{ps}$ .

3.1. Contrastive loss

Contrastive learning is a popular method used in unsupervised learning. Contrastive Learning is a technique that enhances the performance of tasks by using the principle of contrasting samples against each other to learn attributes that are common between data classes and attributes that set apart a data class from another. Instead of directly using the label of the data set, contrastive learning compares augmented data and the original data. Thus, data augmentation on the original data is usually the first step. The resulting model measures the similarity of the data pairs in an effort to make the data generated from the same data and the data generated from different data separated. In contrastive representation, a batch of N samples is generally augmentated through an augmentation function $Aug (\cdot)$ to create a multi-viewed batch with $2 N$ pairs, ${\tilde{x_{i}}, \tilde{y_{i}}}_{i = 1, \dots, 2 N}$ where ${\tilde{x}}_{2 k}$ and ${\tilde{x}}_{2 k - 1}$ are two random augmentations of the same sample $x_{k}$ for $k = 1, \dots, N$ . The samples are fed to an encoder $e : x \to z$ , which takes a sample $x \in X$ to obtain a latent representation $z = e (x)$ . Typically when defining a pre-text task, a predictive model is trained jointly to minimize a contrastive loss function l. $\begin{aligned} (1) & min_{e, h} E_{(x, \tilde{y}) \sim P (X, \tilde{Y})} [l (\tilde{y}, h (e (x)))] \end{aligned}$ where h maps z to an embedding space $h : z \to v$ . Within a multiviewed batch, $i \in I = {1, \dots, 2 N}$ , the contrastive loss is defined as shown in Equation (2): $\begin{aligned} (2) & l = \sum_{i \in I} - log (\frac{exp (sim (v_{i}, v_{j (i)}) / τ)}{\sum_{n \in I ∖ {i}} exp (sim (v_{i}, v_{n}) / τ)}) \end{aligned}$ where $sim (\cdot, \cdot) \in R^{+}$ is a similarity function (e.g., dot product or cosine similarity), $τ \in R^{+}$ is a scalar temperature parameter, i is the anchor, $A (i)$ is the positive(s) and $I ∖ {i}$ are the negatives. The positive and negative samples refer to samples that are semantically similar and dissimilar, respectively. Intuitively, the objective of this function is to bring the positives and the anchor closer in the embedding space v than the anchor and the negatives (i.e., $sim (v^{a}, v^{+}) > sim (v^{a}, v^{-})$ , where $v^{a}$ is the anchor and $v^{+}$ , $v^{-}$ are the positive and negative, respectively).

3.2. Semi-supervised loss

Semi-supervised learning (SSL) is a machine learning technique that uses a small portion of labeled data and lots of unlabeled data to train a predictive model. Supervised learning is training a machine learning model using the labeled data set. Unsupervised learning, on the other hand, is when a model tries to mine hidden patterns, differences, and similarities in unlabeled data by itself, without human supervision. Semi-supervised learning is halfway between supervised and unsupervised learning. Unlike unsupervised learning, SSL works for a variety of problems from classification and regression to clustering and association. Unlike supervised learning, the method uses small amounts of labeled data and also large amounts of unlabeled data, which reduces expenses on manual annotation and cuts data preparation time. In semi-supervised learning, there are two disjoint data sets $D_{L}, D_{U}$ in the whole data set D, where predictive model f is optimized to minimize the supervised loss, jointly with an unsupervised loss as shown in Equation (3): $\begin{aligned} (3) & min_{f} E_{(x, y) \sim P (X, Y)} [l (y, f (x))] + β E_{(x, y_{ps}) \sim P (X, Y_{ps})} [l_{u} (y_{ps}, f (x))] \end{aligned}$

The first term is estimated over the small labeled subset $D_{L}$ , and the second unsupervised loss is estimated over the more significant unlabeled subset $D_{U}$ . The unsupervised loss function $l_{u}$ is defined to help the downstream prediction tasks like a supervised objective on pseudo-labeled samples in our case.

4. Methodology

Figure 1 illustrates our proposed Con2Mix framework, containing 4 parts: (1) triplet mixup data augmentation on the minority (vulnerable) class to address the imbalance in the tabular security data set; (2) contrastive and feature reconstruction loss to train the encoder and the decoder; (3) pseudo-labeling of the subset of the unlabeled data using a label propagation technique; and (4) downstream tasks that train the predictor (e.g., MLP) with the fixed trained encoder.

Fig. 1.

Overview of Con2Mix. (1) for the imbalanced dataset, we first apply a triplet mixup data augmentation on positive classes (vulnerable cases) to get a more balanced dataset. (2) we split the dataset into the labeled dataset and unlabeled dataset with a labeled ratio (e.g., 0.1). (3) through an encoder and decoder, we get the feature reconstruction loss $l^{r}$ , and through an encoder, hidden mixup, and projection network, we get the contrastive loss $l^{\sup}$ . (4) we get pseudo-labels of the unlabeled data using label propagation. (5) after pre-training the encoder, the encoder is fixed and will be used for downstream tasks with the generated pseudo-labels to train the predictor (e.g., MLP).

4.1. Triplet mixup data augmentation

This section proposes our novel data augmentation technique to generate more minority examples. By generating more minority examples, this method makes the data set more balanced, so that successfully helps imbalanced issues in the imbalanced tabular data.

Traditional Mixup [66] augments data between a data pair: $\begin{aligned} (4) & \hat{x} = λ x_{i} + (1 - λ) x_{j}, \hat{y} = λ y_{i} + (1 - λ) y_{j} \end{aligned}$ where $λ \sim Beta (α, α)$ for $α \in (0, \infty)$ . In contrast, we propose a triplet mixup data augmentation to fit tabular domains: $\begin{aligned} (5) & \hat{x} = λ_{i} x_{i} + λ_{j} x_{j} + (1 - λ_{i} - λ_{j}) x_{k} \end{aligned}$ where $λ_{i}, λ_{j} \sim Uniform (0, α)$ with $α \in (0, 0.5]$ . Data $x_{i}$ , $x_{j}$ , and $x_{k}$ are from the same class. In our case, they are all from the minority (vulnerable) class (positive class) because our goal is to alleviate the data imbalance problem.

For example, if we have three data points $(1, 2, 1)$ , $(2, 3, 4)$ , and $(5, 2, 1)$ all from the minority class with $λ_{1} = 0.2$ , $λ_{2} = 0.3$ , and $1 - λ_{1} - λ_{2} = 0.5$ , then the new generated data point is $(3.3, 2.3, 1.9)$ . If we have more than two class labels, we do data augmentation on classes with less than $10 %$ samples of other classes.

Triplet Mixup can not only fit tabular domains but also can apply to computer vision and natural language procession domains with certain changes in the representation of the data.

4.2. Contrastive and feature reconstruction loss

In addition to the data augmentation to the input data, we also do mixup data augmentation in the hidden space. Given an encoder e, that is comprised of T layers $f_{t}$ ( $t \in 1, \dots, T$ ), the samples are fed through to an intermediate representation $h_{t}$ at layer t. We do interpolation in this intermediate layer, then the hidden mixup is shown below: $\begin{aligned} (6) & {\tilde{h}}_{i j}^{t} = λ h_{i}^{t} + (1 - λ) h_{j}^{t} \end{aligned}$ where $λ \sim Uniform (0, α)$ with $α \in (0, 0.5]$ , and $h_{i}^{t}$ is the hidden representation of input $x_{i}$ by feeding $x_{i}$ to layer t of encoder e. We then feed the augmented samples ${\tilde{h}}_{i}^{t}$ and the original samples $h_{i}^{t}$ , $h_{j}^{t}$ to the rest of the encoder layers $t, \dots, T$ to obtain the latent representation z.

As shown in Fig. 1, we represent latent representation of labeled samples as $z_{l}$ and unlabeled samples as $z_{u}$ . We define the contrastive loss term to encourage samples created from pairs of the same class to have high similarity: $\begin{aligned} (7) & l_{τ}^{\sup} = \sum_{i \in I} \frac{- 1}{| P (i) |} \sum_{p \in P (i)} log (\frac{exp (sim (h_{i}^{proj}, h_{p}^{proj}) / τ)}{\sum_{n \in Ne (i)} exp (sim (h_{i}^{proj}, h_{n}^{proj}) / τ)}) \end{aligned}$ where $P (i) = {p | p \in A (i), y_{i} = {\tilde{y}}_{p}}$ is the set of indices of positives with the same label as example i, $| P (i) |$ is its cardinality, and $Ne (i) = {n | n \in I, y_{i} \neq y_{n}}$ . Function $h^{proj}$ is a mapping of the latent representation to an embedding space via a projection network where the contrastive loss term is defined. This objective function will encourage mixed-upped labeled samples and anchors of the same sample to be close, leading to better clusterable representation.

In addition to the contrastive loss term, the encoder is trained to minimize the feature reconstruction loss via a decoder $f_{θ} (e (x))$ : $\begin{aligned} (8) & l_{r} (x_{i}) = \frac{| C |}{d} \sum_{c}^{| C |} ‖ f_{θ} (e_{ϕ} {(x_{i})}^{c}) - x_{i}^{c}) ‖_{2}^{2} + \frac{| D |}{d} \sum_{j}^{| D |} \sum_{o}^{d_{D_{j}}} 1 [x_{i}^{d} = o] log (f_{θ} (e_{ϕ} {(x_{i})}^{o})) \end{aligned}$ where C means continuous features and D means discrete features. We then combine the contrastive loss and feature reconstruction loss: $\begin{aligned} (9) & L = E_{(x, y) \sim D_{L}} [l_{τ}^{\sup} (y, f (x))] + β E_{x \sim D_{U} \cup D_{L}} [l_{r} (x)] \end{aligned}$

We first train the encoder using this loss term over K epochs to warm-start the representations in the latent space prior to pseudo-labeling and leverage the unlabeled samples.

4.3. Pseudo-labeling

Until now we have only used the labeled set $D_{L}$ in the contrastive loss term $l_{τ}^{\sup}$ . We next use label propagation [28,72] after K epochs of training with the supervised contrastive loss term $L^{\sup}$ . Given the encoder trained on $D_{L}$ for K epochs, we map the small labeled set $D_{L}$ and a subset of the unlabeled set $S_{U} \subset D_{U}$ to the latent space z and construct an affinity matrix G: $\begin{aligned} (10) & y = \{\begin{array}{ll} sim (z_{i}, z_{j}) & if i \neq j and z_{j} \in {NN}_{k} (i) \\ 0 & otherwise \end{array} \end{aligned}$ where ${NN}_{k} (i)$ is the k nearest neighbor of sample $z_{i}$ , and $sim (z_{i}, z_{j})$ is the similarity measure (e.g. $z_{i}^{T} z_{j}$ ). We then obtain pseudo-labels for our unlabeled samples by computing the diffusion matrix C and setting ${\tilde{y}}_{i} : = arg {max}_{j} c_{i j}$ , where $(I - α A) C = Y$ . We use a conjugate method [28,73] to solve linear equations to obtain C, facilitating the efficient computation of the pseudo-labels. Here $A = D^{- 1 / 2} W D^{- 1 / 2}$ is the adjacency matrix, $W = G^{T} + G$ and $D = diag (W 1_{n})$ is the degree matrix. Once we obtain the pseudo-labels for the unlabeled subset $S_{U}$ , we train the encoder with unlabeled samples treating the generated labels as ground truth. Finally, the loss item combining contrastive loss and reconstruction loss with pseudo-labels is shown in Equation (11): $\begin{aligned} (11) & L = E_{(x, y) \sim D_{L}} [l^{\sup} (y, f (x))] + γ E_{x, y_{p s} \sim S_{U}} [l^{\sup} (y_{p s}, f (x))] + β E_{x \sim D_{U}} [l_{r} (x)] \end{aligned}$ We update the pseudo-labels every f epochs of training with the above loss term.

4.4. Downstream tasks

After pre-training the encoder, the encoder is fixed and will be used for downstream tasks (see Fig. 1) with the generated pseudo-labels to train the predictor (e.g., MLP). We leverage Mixup augmentation [66] in the latent space and feed samples to a set of fully connected layers. The training loss for the downstream tasks is combined with the cross-entropy loss for supervised ( $\sup$ ) for the labeled subset and unsupervised ( $unsup$ ) for the unlabeled subset as shown below: $\begin{aligned} (12) & L = l_{ce}^{\sup} + γ L_{ce}^{unsup} \end{aligned}$ where $ce$ represents cross-entropy, $\sup$ means supervised, and $unsup$ means unsupervised.

4.5. Algorithm

Algorithm 1 integrates all four parts of our proposed Con2Mix method into a single pseudo-code. Triplet mixup data augmentation comprises lines 2–10, manifold mixup comprises lines 15–21, contrastive and feature reconstruction loss comprises lines 22–25, and pseudo-labeling comprises lines 26–30. Line 32 represents the total loss including pseudo-labeling. As mentioned in line 25, we first optimize the contrastive and feature reconstruction loss over K epochs, then add pseudo-labeling to the loss term. Lines 36–41 represent downstream tasks: the predictor training using the cross entropy loss with both of the supervised and unsupervised parts.

Algorithm 1:

Con2Mix

5. Experiments

This section demonstrates the superiority of our method on a large, imbalanced tabular security data set that we manually assembled from NVD and open source repositories (see Section 2.2). The data set 1 contains 5692 records with 53 positive (vulnerable) cases and 5639 negative (non-vulnerable) cases, making data augmentation essential. We also do experiments on a less imbalanced data set, data set 2, which contains more positive cases (4491 records with 165 positive cases and 4326 negative cases) and get similar results.

We compare our method with 6 supervised methods (XGBoost [6], MLP, Logit Regression, SVM, Decision Tree, and KNN) and two state-of-the-art semi-supervised methods on tabular domains (VIME [63] and Contrastive Mixup [14]) with or without our proposed Triplet Mixup data augmentation. We compare our data augmentation method with existing down-sampling and SMOTE [5] up-sampling methods. We do experiments with different mixup strategies and demonstrate why triplet is superior. In addition, we compare with the pairwise mixup data augmentation with different loss functions, and show that our method with standard cross entropy loss function performs best among all the loss functions. We also do experiments with different labeled ratios and choose 0.1 as our labeled ratio for semi-supervised learning.

All the experimental results record accuracy, precision, recall, TPR (true positive rate), TNR (true negative rate), micro F1 score, macro F1 score, and weighted F1 score. Positive means vulnerable and negative means non-vulnerable. For the security data set, we assign more importance to the recall and TPR.

5.1. Results

5.1.1. Main experimental results

Table 4 shows the main experimental results of our method compared with 6 supervised methods and 2 semi-supervised methods (labeled ratio 0.1). We separate the experiments into 2 parts: (1) without triplet mixup data augmentation (upper part) and (2) with triplet mixup data augmentation (lower part). The upper part of Table 4 shows that without triplet mixup data augmentation, even the accuracy and TNR are near 100, and the precision, recall, and TPR are all 0, which means that all the methods can only predict negative data and not the positive (vulnerable) data sought by defenders. The results show all the methods perform poorly on imbalanced tabular security data without triplet data augmentation no matter how many labeled data the model utilizes.

Table 4
Main experimental results on Data Set 1. Top two are shaded and best is bold

Model Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

Without Triplet Mixup Data Augmentation

Supervised (4554 labeled data)

XGBoost 99.21 0 0 0 100 99.21 49.80 98.82

MLP 99.21 0 0 0 100 99.21 49.80 98.82

Logit Regression 99.21 0 0 0 100 99.21 49.80 98.82

SVM 99.21 0 0 0 100 99.21 49.80 98.82

Decision Tree 98.51 0 0 0 99.29 98.51 49.62 98.46

KNN 99.12 0 0 0 99.91 99.12 49.78 98.77

Semi-supervised (455 labeled data, 0.1 labeled ratio)

VIME 99.21 0 0 0 100 99.21 49.80 98.82

Contrastive Mixup 99.21 0 0 0 100 99.21 49.80 98.82

With Triplet Mixup Data Augmentation

Supervised (17798 labeled data)

XGBoost 98.68 0 0 0 99.47 98.68 49.67 98.55

MLP 97.54 4.76 11.11 11.11 98.23 97.54 52.71 98.03

Logit Regression 81.02 2.74 66.67 66.67 81.13 81.02 47.36 88.79

SVM 89.37 4.10 55.56 55.56 89.64 89.37 51.00 93.67

Decision Tree 97.28 4.17 11.11 11.11 97.96 97.28 52.34 97.89

KNN 91.39 4.12 44.44 44.44 91.76 91.39 51.52 94.79

Semi-supervised (1779 labeled data and 0.1 labeled ratio)

VIME 78.30 2.40 66.67 66.67 78.39 78.30 46.19 87.10

Con2Mix 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

Model	Accuracy	Precision	Recall	TPR	TNR	F1 Score
Without Triplet Mixup Data Augmentation
Supervised (4554 labeled data)
XGBoost	99.21	0	0	0	100	99.21	49.80	98.82
MLP	99.21	0	0	0	100	99.21	49.80	98.82
Logit Regression	99.21	0	0	0	100	99.21	49.80	98.82
SVM	99.21	0	0	0	100	99.21	49.80	98.82
Decision Tree	98.51	0	0	0	99.29	98.51	49.62	98.46
KNN	99.12	0	0	0	99.91	99.12	49.78	98.77
Semi-supervised (455 labeled data, 0.1 labeled ratio)
VIME	99.21	0	0	0	100	99.21	49.80	98.82
Contrastive Mixup	99.21	0	0	0	100	99.21	49.80	98.82
With Triplet Mixup Data Augmentation
Supervised (17798 labeled data)
XGBoost	98.68	0	0	0	99.47	98.68	49.67	98.55
MLP	97.54	4.76	11.11	11.11	98.23	97.54	52.71	98.03
Logit Regression	81.02	2.74	66.67	66.67	81.13	81.02	47.36	88.79
SVM	89.37	4.10	55.56	55.56	89.64	89.37	51.00	93.67
Decision Tree	97.28	4.17	11.11	11.11	97.96	97.28	52.34	97.89
KNN	91.39	4.12	44.44	44.44	91.76	91.39	51.52	94.79
Semi-supervised (1779 labeled data and 0.1 labeled ratio)
VIME	78.30	2.40	66.67	66.67	78.39	78.30	46.19	87.10
Con2Mix	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28

The lower part of Table 4 applies triplet mixup data augmentation to the original positive data to obtain 17798 data with 4554 original data. The experimental results show that Con2Mix achieves the best recall and TPR (66.67) with only 0.1 labeled data (1779/17798). Although Logit Regression and VIME also achieve the same recall and TPR, they all exhibit worse performance according to the other metrics. In addition, Logit Regression is a supervised method utilizing all the labeled data, wheras our method uses only 0.1 labeled data. Thus, our method Con2Mix achieves the best performance among all the tested methods.

Table 5 shows the main experimental results of our method on data set 2. When we add our triplet mixup data augmentation, we can see that the results of all the methods increase (recall and tpr increase). Even logit regression and VIME achieve the best recall and tpr (100 and 96.43), other metrics of their models are low (accuracy of VIME is 19.04, and the accuracy of logit regression is 13.59). Thus, except for logistic regression and VIME, our method achieve the best recall and tpr (75.00) with much higher other metrics.

Table 5

Main experimental results on Data Set 2. Top two are shaded and best is bold

Model	Accuracy	Precision	Recall	TPR	TNR	F1 Score

						Micro	Macro	Weighted
Without Triplet Mixup Data Augmentation
Supervised (3593 labeled data)
XGBoost	96.33	27.27	10.71	10.71	99.08	96.33	56.75	95.54
MLP	96.88	0	0	0	100	96.88	49.21	95.35
Logit Regression	96.77	33.33	3.57	3.57	99.77	96.77	52.40	95.49
SVM	96.77	0	0	0	99.89	96.77	49.18	95.29
Decision Tree	93.43	15.56	25.00	25.00	95.63	93.43	57.88	94.16
KNN	96.44	0	0	0	99.54	96.44	49.09	95.12
Semi-supervised (359 labeled data, 0.1 labeled ratio)
VIME	96.88	0	0	0	100	96.88	49.21	95.35
Contrastive Mixup	96.88	0	0	0	100	96.88	49.21	95.35
With Triplet Mixup Data Augmentation
Supervised (13218 labeled data)
XGBoost	94.54	21.62	28.57	28.57	96.67	94.54	60.89	94.91
MLP	88.53	11.34	39.29	39.29	90.11	88.53	55.72	91.46
Logit Regression	13.59	3.37	96.43	96.43	10.92	13.59	13.09	19.26
SVM	73.61	5.91	50.00	50.00	74.37	73.61	47.54	82.21
Decision Tree	87.97	7.45	25.00	25.00	90.00	87.97	52.51	90.99
KNN	65.70	6.52	75.00	75.00	65.40	65.70	45.35	76.62
Semi-supervised (1322 labeled data and 0.1 labeled ratio)
VIME	19.04	3.71	100	100	16.44	19.04	17.69	27.58
Con2Mix	74.16	8.54	75.00	75.00	74.14	74.16	50.04	82.59

Here, we assign more attention to the metrics TPR and Recall rather than other methods as it is more important to detect positive (vulnerable) data in the security domain. Sometimes we need to sacrifice other metrics to guarantee a higher TPR and Recall.

5.1.2. Experimental results with different sampling methods

Table 6 compares our triplet mixup data augmentation method with a down-sampling (reduce the data points) and an up-sampling method (SMOTE [5]). Compared with down-sampling and SMOTE, our method achieves the best recall, TPR (66.67), and precision (3.95), demonstrating that our method is better overall. Down-sampling reduces negative samples, losing information from the negative samples. In contrast, up-sampling generates more samples from the positive samples. SMOTE generates positive samples by adding small amounts to positive samples. However, our method leverages more information from positive samples by mixing-up triple data points.

Table 6
Experimental results with different sampling methods on Data Set 1. Top two are shaded and best is bold

Method Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

Supervised SVM

Down Sampling 92.53 3.66 33.33 33.33 93.00 92.53 51.35 95.40

SMOTE 88.49 3.79 55.56 55.56 88.75 88.49 50.48 93.18

Semi-supervised (0.1 labeled ratio)

Con2Mix 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

Method	Accuracy	Precision	Recall	TPR	TNR	F1 Score
Supervised SVM
Down Sampling	92.53	3.66	33.33	33.33	93.00	92.53	51.35	95.40
SMOTE	88.49	3.79	55.56	55.56	88.75	88.49	50.48	93.18
Semi-supervised (0.1 labeled ratio)
Con2Mix	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28

5.1.3. Experimental results with different mixup strategies

Table 7 compares different mixup strategies: pairwise, quadruplet, pairwise + original (mixing a pair of data points including the output of pairwise mixup), pairwise + triplet (mix a pair of data points followed by triplet mixup, including the output of pairwise mixup). Triplet achieves the best recall and TPR (66.67). Although pairwise + original and pairwise + triplet achieve the same recall and TPR, Triplet achieves the best accuracy, precision, TNR, and F1 scores. Pairwise does not generate enough augmentations, while quadruplet generates too many. Triplet achieves the best results compared with mixed ones because it combines triple data with balanced weights.

Table 7
Experimental results with different mixup strategies on Data Set 1. Top two are shaded and best is bold

Method Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

Pairwise 93.59 5.56 44.44 44.44 93.98 93.59 53.28 95.99

Quadruplet 84.18 2.23 44.44 44.44 84.50 84.18 47.82 90.69

Pairwise+Original 78.21 2.39 66.67 66.67 78.30 78.21 46.16 87.04

Pairwise+Triplet 85.68 3.61 66.67 66.67 85.83 85.68 49.55 91.57

Triplet 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

Method	Accuracy	Precision	Recall	TPR	TNR	F1 Score
Pairwise	93.59	5.56	44.44	44.44	93.98	93.59	53.28	95.99
Quadruplet	84.18	2.23	44.44	44.44	84.50	84.18	47.82	90.69
Pairwise+Original	78.21	2.39	66.67	66.67	78.30	78.21	46.16	87.04
Pairwise+Triplet	85.68	3.61	66.67	66.67	85.83	85.68	49.55	91.57
Triplet	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28

5.1.4. Experimental results with different loss functions

Table 8 compares different loss functions aiming to deal with the data imbalance problem. We first apply pairwise mixup data augmentation on the original data and run experiments with different loss functions: Focal Loss [35], CB (Class Balanced) Loss [12], and Weighted CE (Cross Entropy) Loss. Weighted CE means cross entropy loss with weights 1 and 10 for the majority and minority classes, respectively. Our model uses the standard cross entropy loss. Our method achieves the highest TPR of 66.67 compared with only 44.44 by the others, while maintaining high levels for all other metrics.

Table 8
Experimental results with different loss functions on Data Set 1. Top two are shaded and best is bold

Method Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

With Pairwise Mixup Data Augmentation

Focal Loss 93.23 5.26 44.44 44.44 93.62 93.23 52.95 95.80

CB Loss 93.06 5.13 44.44 44.44 93.45 93.06 52.79 95.70

Weighted CE 93.59 5.56 44.44 44.44 93.98 93.59 53.28 95.99

With Triplet Mixup Data Augmentation

Con2Mix 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

Method	Accuracy	Precision	Recall	TPR	TNR	F1 Score
With Pairwise Mixup Data Augmentation
Focal Loss	93.23	5.26	44.44	44.44	93.62	93.23	52.95	95.80
CB Loss	93.06	5.13	44.44	44.44	93.45	93.06	52.79	95.70
Weighted CE	93.59	5.56	44.44	44.44	93.98	93.59	53.28	95.99
With Triplet Mixup Data Augmentation
Con2Mix	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28

5.2. Ablation study

We do ablation study with two important components: (1) Triplet Mixup Data Augmentation (Input Mixup), and (2) Mixup in Hidden Layers (Hidden Mixup). Table 9 shows the results after removing each mixup. From Table 9, we can see that without input mixup, the model cannot handle positive (vulnerable) cases. After adding the triplet mixup data augmentation, even without hidden mixup, the recall and TPR increase from 0 to 66.67. After adding the hidden mixup, other metrics except recall and TPR increase slightly.

Table 9
Ablation study on Data Set 1. Top two are shaded and best is bold

Method Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

No Mixup 99.03 0 0 0 99.82 99.03 49.76 98.73

No Input Mixup 99.21 0 0 0 100 99.21 49.80 98.82

No Hidden Mixup 86.29 3.77 66.67 66.67 86.45 86.29 49.87 91.92

Con2Mix 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

Method	Accuracy	Precision	Recall	TPR	TNR	F1 Score
No Mixup	99.03	0	0	0	99.82	99.03	49.76	98.73
No Input Mixup	99.21	0	0	0	100	99.21	49.80	98.82
No Hidden Mixup	86.29	3.77	66.67	66.67	86.45	86.29	49.87	91.92
Con2Mix	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28

5.3. Parameter analysis

The most important parameter in our work is the labeled ratio. Table 10 examines ratios from 0.01 to 0.90. A ratio of 0.01 yields recall and TPR of 100 with accuracy of only 2.64 and TNR of only 1.86, which means the model performs very poorly on the security data set due to lack of labeled data. Setting the ratio from 0.02 to 0.10 achieves the best recall and TPR (66.67) among all the labeled ratios except 0.01. A ratio of 0.10 achieves the best accuracy, precision, TNR, and F1 scores; whereas increasing it from 0.15 to 0.90 decreases the recall and TPR to 11.11 with other metrics increasing slightly. This means the model overfits the data set. Thus, we choose 0.10 as the most appropriate labeled ratio.

Table 10
Experimental results with different labeled ratios on Data Set 1

Labeled ratio Accuracy Precision Recall TPR TNR F1 Score

Micro Macro Weighted

0.01 2.64 0.81 100 100 1.86 2.64 2.63 3.64

0.02 85.41 3.55 66.67 66.67 85.56 85.41 49.41 91.41

0.03 83.13 3.08 66.67 66.67 83.26 83.13 48.31 90.06

0.05 85.24 3.51 66.67 66.67 85.39 85.24 49.33 91.31

0.07 83.57 3.16 66.67 66.67 83.70 83.57 48.51 90.32

0.10 86.91 3.95 66.67 66.67 87.07 86.91 50.20 92.28

0.15 87.17 3.40 55.56 55.56 87.42 87.17 49.76 92.43

0.17 88.49 3.79 55.56 55.56 88.75 88.49 50.48 93.18

0.20 88.40 3.76 55.56 55.56 88.66 88.40 50.43 93.13

0.30 91.04 4.85 55.56 55.56 91.32 91.04 52.11 94.60

0.40 92.88 6.10 55.56 55.56 93.18 92.88 53.64 95.62

0.50 93.85 5.80 44.44 44.44 94.24 93.85 53.54 96.13

0.60 93.67 4.35 33.33 33.33 94.15 93.67 52.21 96.02

0.70 94.11 3.23 22.22 22.22 94.69 94.11 51.30 96.24

0.80 94.82 1.92 11.11 11.11 95.48 94.82 50.31 96.59

0.90 95.17 2.08 11.11 11.11 95.84 95.17 50.52 96.78

Labeled ratio	Accuracy	Precision	Recall	TPR	TNR	F1 Score
0.01	2.64	0.81	100	100	1.86	2.64	2.63	3.64
0.02	85.41	3.55	66.67	66.67	85.56	85.41	49.41	91.41
0.03	83.13	3.08	66.67	66.67	83.26	83.13	48.31	90.06
0.05	85.24	3.51	66.67	66.67	85.39	85.24	49.33	91.31
0.07	83.57	3.16	66.67	66.67	83.70	83.57	48.51	90.32
0.10	86.91	3.95	66.67	66.67	87.07	86.91	50.20	92.28
0.15	87.17	3.40	55.56	55.56	87.42	87.17	49.76	92.43
0.17	88.49	3.79	55.56	55.56	88.75	88.49	50.48	93.18
0.20	88.40	3.76	55.56	55.56	88.66	88.40	50.43	93.13
0.30	91.04	4.85	55.56	55.56	91.32	91.04	52.11	94.60
0.40	92.88	6.10	55.56	55.56	93.18	92.88	53.64	95.62
0.50	93.85	5.80	44.44	44.44	94.24	93.85	53.54	96.13
0.60	93.67	4.35	33.33	33.33	94.15	93.67	52.21	96.02
0.70	94.11	3.23	22.22	22.22	94.69	94.11	51.30	96.24
0.80	94.82	1.92	11.11	11.11	95.48	94.82	50.31	96.59
0.90	95.17	2.08	11.11	11.11	95.84	95.17	50.52	96.78

6. Discussion

Semi-supervised learning techniques are widely used in image and language domains due to the ease of data augmentation. However, in security domains, the data are typically in a tabular format, which makes augmentation challenging as it lacks context and leads to significant information loss. Tabular security datasets also suffer from a high degree of class imbalance, with ratios like 1:100 or 1:50 between positive (vulnerable) and negative (non-vulnerable) cases. Traditional methods, such as down sampling and up sampling (e.g. [5]), struggle to effectively address this issue. To tackle the problem, researchers have proposed various approaches including set convolution [22] and episodic training aiming to extract representative samples for each class to achieve a balanced class distribution. Focal loss [35] reshapes the standard cross-entropy loss to prioritize difficult examples during training. Class-balanced loss [12] employs re-weighting based on the effective number of samples for each class. Alternatively, we present a novel technique called triplet mixup data augmentation that focuses on augmenting minority data. Triplet mixup goes beyond pairwise mixup by combining three data points from the same class, effectively integrating their features.

Experimental results demonstrate the effectiveness of our proposed method compared to previous approaches. Additionally, we apply manifold mixup in the hidden space and leverage unlabeled samples through pre-training the encoder and using label propagation to generate pseudo-labels. The trained encoder and samples with pseudo-labels are then used for a downstream task, where a simple predictor with mixup augmentation is trained. Our method is not limited to the specific data sets we utilize; instead, it can be applied to various tabular data sets since the theory and framework are designed for general tabular data. Moreover, our method is particularly valuable for addressing the challenge of imbalanced tabular data sets, as the triplet mixup technique is specifically tailored to tackle the issue of class imbalance.

7. Conclusion and future work

This paper proposed and evaluated Con2Mix, a novel semi-supervised machine learning method for analyzing highly imbalanced tabular security data sets. The method includes 4 components: (1) Triplet Mixup Data Augmentation, (2) Contrastive and Feature Reconstruction Loss, (3) Pseudo-labeling, and (4) Downstream Tasks. Comparison of Con2Mix with 6 supervised methods and 2 state-of-the-art semi-supervised methods in tabular domains shows that without the triplet mixup data augmentation, all the methods perform poorly (with 0 precision, recall, and TPR) on the data set. After adding our proposed triplet mixup data augmentation, the results improve substantially, with Con2Mix achieving the best recall and TPR (66.67). Future research should explore generalizing our technique to additional data sets in other domains. To enhance our proposed method, our plan is to work on graph data sets, such as CFGs, CPGs, and ASTs extracted from open-source applications to detect software vulnerabilities at the source and binary levels. Our proposed method will help the developers or experts to automatically label the graphs related to each function and component.

Footnotes

Acknowledgment

The research reported herein was supported in part by NSF awards DMS-1737978, DGE-2039542, OAC-1828467, OAC-1931541, and DGE-1906630, ONR awards N00014-17-1-2995 and N00014-20-1-2738, DARPA FA8750-19-C-0006, Army Research Office Contract No. W911NF2110032 and IBM faculty award (Research).

References

Balakrishnan and

Reps, WYSINWYX: What you see is not what you eXecute, ACM Transactions on Programming Languages And Systems (TOPLAS)32(6) (2010). doi:10.1145/1749608.1749612.

Berthelot,

Carlini,

Goodfellow,

Papernot,

Oliver and

C.A.

Raffel, MixMatch: A holistic approach to semi-supervised learning, Advances in Neural Information Processing Systems (NeurIPS)32 (2019).

Boland and

P.E.

Black, Juliet 1.1 C/C++ and Java test suite, Computer45(10) (2012), 88–90. doi:10.1109/MC.2012.345.

Caron,

Misra,

Mairal,

Goyal,

Bojanowski and

Joulin, Unsupervised learning of visual features by contrasting cluster assignments, Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 9912–9924.

N.V.

Chawla,

K.W.

Bowyer,

L.O.

Hall and

W.P.

Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research (JAIR)16(1) (2002), 321–357. doi:10.1613/jair.953.

Chen and

Guestrin, XGBoost: A scalable tree boosting system, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDDM), 2016, pp. 785–794. doi:10.1145/2939672.2939785.

Chen,

Kornblith,

Norouzi and

Hinton, A simple framework for contrastive learning of visual representations, in: Proceedings of the 37th IEEE International Conference on Machine Learning (ICML), 2020, pp. 1597–1607.

Chi,

Dong,

Wei,

Yang,

Singhal,

Wang,

Song,

X.-L.

Mao,

Huang and

Zhou, InfoXLM: An information-theoretic framework for cross-lingual language model pre-training, in: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAAD-HLT), 2021, pp. 3576–3588.

M.-J.

Choi,

Jeong,

Oh and

Choo, End-to-end prediction of buffer overruns from raw source code via neural memory networks, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 1546–1553.

10.

C.-Y.

Chuang,

R.D.

Hjelm,

Wang,

Vineet,

Joshi,

Torralba,

Jegelka and

Song, Robust contrastive learning against noisy views, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16670–16681.

11.

Cui,

Zhong,

Tian,

Liu,

Yu and

Jia, Generalized parametric contrastive learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).

12.

Cui,

Jia,

T.-Y.

Lin,

Song and

Belongie, Class-balanced loss based on effective number of samples, in: Proceedings of the 37th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9268–9277.

13.

Dai,

Liu,

Liao,

Huang,

Cao,

Wu,

Zhao,

Xu,

Liu,

Li,

Zhu,

Cai,

Sun,

Li,

Shen,

Liu and

Li, AugGPT: Leveraging ChatGPT for text data augmentation, 2023, arXiv preprint arXiv:2302.13007.

14.

Darabi,

Fazeli,

Pazoki,

Sankararaman and

Sarrafzadeh, Contrastive mixup: Self-and semi-supervised learning for tabular domain, 2021, arXiv preprint arXiv:2108.12296.

15.

Deng,

Dong,

Socher,

L.-J.

Li,

Li and

Fei-Fei, ImageNet: A large-scale hierarchical image database, in: Proceedings of the 27th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.

16.

Devlin,

M.-W.

Chang,

Lee and

Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAAD-HLT), 2019, pp. 4171–4186.

17.

Doyle and

Walden, An empirical study of the evolution of PHP web application security, in: Proceedings of the 3rd International Workshop on Security Measurements and Metrics, 2011, pp. 11–20.

18.

Du,

Chen,

Li,

Guo,

Zhou,

Liu and

Jiang, Leopard: Identifying vulnerable code for vulnerability assessment through program metrics, in: Proceedings of the 41st International Conference on Software Engineering (ICSE), 2019, pp. 60–71.

19.

Fang,

Wang,

Zhou,

Ding and

Xie, Cert: contrastive self-supervised learning for language understanding, 2020, arXiv preprint arXiv:2005.12766.

20.

S.Y.

Feng,

Gangal,

Wei,

Chandar,

Vosoughi,

Mitamura and

Hovy, A survey of data augmentation approaches for NLP, 2021, arXiv preprint arXiv:2105.03075.

21.

Fenton and

Bieman, Software Metrics: A Rigorous and Practical Approach, 3rd edn, CRC Press, Inc., USA, 2014.

22.

Gao,

Y.-F.

Li,

Lin,

Aggarwal and

Khan, SetConv: A new approach for learning from imbalanced data, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 1284–1294.

23.

Gegick,

Williams,

Osborne and

Vouk, Prioritizing software security fortification through code-level metrics, in: Proceedings of the 4th ACM Workshop on Quality of Protection (QoP), 2008, pp. 31–38. doi:10.1145/1456362.1456370.

24.

Hao,

Zhu,

Appalaraju,

Zhang,

Li and

Li, MixGen: A new multi-modal data augmentation, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 379–389.

25.

He,

Fan,

Wu,

Xie and

Girshick, Momentum contrast for unsupervised visual representation learning, in: Proceedings of the 38th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9729–9738.

26.

Huang,

Wang,

Liu,

Chen and

Li, Contrastive semi-supervised learning for underwater image restoration via reliable bank, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18145–18155.

27.

Ioffe and

Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015, pp. 448–456.

28.

Iscen,

Tolias,

Avrithis and

Chum, Label propagation for deep semi-supervised learning, in: Proceedings of the 37th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5070–5079.

29.

Kobayashi, Contextual augmentation: Data augmentation by words with paradigmatic relations, in: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Vol. 2, 2018, pp. 452–457.

30.

Krizhevsky,

Sutskever and

G.E.

Hinton, ImageNet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems (NeurIPS)25 (2012).

31.

Lavee,

Khan and

Thuraisingham, A framework for a video analysis tool for suspicious event detection, Multimedia Tools and Applications35(1) (2007), 109–123. doi:10.1007/s11042-007-0117-8.

32.

Li,

M.D.

Hossain,

Ochiai and

Khan, 2MiCo: A contrastive semi-supervised method with double mixup for smart meter modbus RS-485 communication security, in: Proceedings of the IEEE 9th International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC) and IEEE International Conference on Intelligent Data and Security (IDS), 2023, pp. 30–39.

33.

Li,

Khan,

Zamani,

Wickramasuriya,

K.W.

Hamlen and

Thuraisingham, MCoM: A semi-supervised method for imbalanced tabular security data, in: Proceedings of the 36th Annual IFIP Data and Applications Security and Privacy Conference (DBSec), 2022, pp. 48–67.

34.

Li,

Wang,

Li,

Khan and

Thuraisingham, LPC: A logits and parameter calibration framework for continual learning, in: Findings of the Association for Computational Linguistics (EMNLP), 2022, pp. 7142–7155.

35.

T.-Y.

Lin,

Goyal,

Girshick,

He and

Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988.

36.

M.M.

Masud,

Gao,

Khan,

Han and

Thuraisingham, Classification and novel class detection in data streams with active mining, in: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2010, pp. 311–324. doi:10.1007/978-3-642-13672-6_31.

37.

M.M.

Masud,

Khan and

Thuraisingham, A hybrid model to detect malicious executables, in: Proceedings of the IEEE International Conference on Communications, 2007, pp. 1443–1448.

38.

T.J.

McCabe, A complexity measure, IEEE Transactions on Software Engineering (TSE)SE-2(4) (1976), 308–320. doi:10.1109/TSE.1976.233837.

39.

Meneely and

Williams, Strengthening the empirical analysis of the relationship between linus’ law and software security, in: Proceedings of the 4th ACM International Symposium on Empirical Software Engineering and Measurement (ESEM), 2010.

40.

Misra and

van der Maaten, Self-supervised learning of pretext-invariant representations, in: Proceedings of the 38th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 6707–6717.

41.

Nagappan,

Ball and

Zeller, Mining metrics to predict component failures, in: Proceedings of the 28th International Conference on Software Engineering (ICSE), 2006, pp. 452–461.

42.

Park,

Lee,

I.-J.

Kim and

Sohn, Probabilistic representations for video contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14711–14721.

43.

Parveen,

Z.R.

Weger,

Thuraisingham,

Hamlen and

Khan, Supervised learning for insider threat detection using stream mining, in: Proceedings of the IEEE 23rd International Conference on Tools with Artificial Intelligence, 2011, pp. 1032–1039.

44.

Piwowarski, A nesting level complexity measure, ACM SIGPLAN Notices17(9) (1982), 44–50. doi:10.1145/947955.947960.

45.

Radford,

Narasimhan,

Salimans and

Sutskever, Improving language understanding by generative pre-training, Technical Report, OpenAI, 2018.

46.

Sennrich,

Haddow and

Birch, Improving neural machine translation models with monolingual data, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), 2016.

47.

Shin and

Williams, Is complexity really the enemy of software security? in: Proceedings of the 4th ACM Workshop on Quality of Protection (QoP), 2008, pp. 47–50. doi:10.1145/1456362.1456372.

48.

Shin and

Williams, Can traditional fault prediction models be used for vulnerability prediction?, Empirical Software Engineering18(1) (2013), 25–59. doi:10.1007/s10664-011-9190-8.

49.

Simonyan and

Zisserman, Very deep convolutional networks for large-scale image recognition, in: Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.

50.

Sohn,

Berthelot,

Carlini,

Zhang,

C.A.

Raffel,

E.D.

Cubuk,

Kurakin and

C.-L.

Li, FixMatch: Simplifying semi-supervised learning with consistency and confidence, Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 596–608.

51.

Srivastava,

Hinton,

Krizhevsky,

Sutskever and

Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research (JMLR)15(1) (2014), 1929–1958.

52.

Tian,

Sun,

Poole,

Krishnan,

Schmid and

Isola, What makes for good views for contrastive learning?, Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 6827–6839.

53.

T.H.

Trinh,

M.-T.

Luong and

Q.V.

Le, Selfie: Self-supervised pretraining for image embedding, 2019, arXiv preprint arXiv:1906.02940.

54.

Wang,

Guo,

Z.-H.

Deng and

Lu, Rethinking minimal sufficient representation in contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16041–16050.

55.

Wang,

Zhao,

Zhang,

Ding,

Wang and

Shen, ContrastMask: Contrastive learning to segment every thing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11604–11613.

56.

Wang,

Dong,

Lin,

Wang,

M.S.

Islam and

Khan, Co-representation learning framework for the open-set data classification, in: IEEE International Conference on Big Data (BigData), 2019, pp. 239–244.

57.

Wickramasuriya, ERLKing dataset, 2022.

58.

Wu,

Lin,

Chen,

Bai and

Wang, Interpretable graph convolutional network for multi-view semi-supervised learning, IEEE Transactions on Multimedia (2023).

59.

Yamaguchi,

Golde,

Arp and

Rieck, Modeling and discovering vulnerabilities with code property graphs, in: Proceedings of the 35th IEEE Symposium on Security & Privacy, S&P, 2014.

60.

Yang,

Wu,

Zhang,

Jiang,

Liu,

Zheng,

Zhang,

Wang and

Zeng, Class-aware contrastive semi-supervised learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14421–14430.

61.

Yang,

Duan,

Tran,

Xu,

Chanda,

Chen,

Zeng,

Chilimbi and

Huang, Vision-language pre-training with triple contrastive learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15671–15680.

62.

Yang,

Li,

Zhang,

Xiao,

Liu,

Yuan and

Gao, Unified contrastive learning in image-text-label space, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19163–19173.

63.

Yoon,

Zhang,

Jordon and

van der Schaar, VIME: Extending the success of self-and semi-supervised learning to tabular domain, Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 11033–11043.

64.

Younis,

Malaiya,

Anderson and

Ray, To fear or not to fear that is the question: Code characteristics of a vulnerable function with an existing exploit, in: Proceedings of the 6th ACM Conference on Data and Application Security and Privacy (CODASPY), 2016, pp. 97–104.

65.

Zeller,

Zimmermann and

Bird, Failure is a four-letter word: A parody in empirical research, in: Proceedings of the 7th International Conference on Predictive Models in Software Engineering (Promise), 2011.

66.

Zhang,

Cisse,

Y.N.

Dauphin and

Lopez-Paz, Mixup: Beyond empirical risk minimization, 2017, arXiv preprint arXiv:1710.09412.

67.

Zhang,

Zhu,

Hallinan,

Zhang,

Makmur,

Cai and

B.C.

Ooi, BoostMIS: Boosting medical image semi-supervised learning with adaptive pseudo labeling and informative active annotation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20666–20676.

68.

Zhang,

Liu,

Ou,

Zeng,

Zhuo,

Duan,

Xiong,

Yu,

Liu,

Liu and

Ye, CarveMix: A simple data augmentation method for brain lesion segmentation, NeuroImage271 (2023), 120041. doi:10.1016/j.neuroimage.2023.120041.

69.

Zheng,

You,

Huang,

Wang,

Qian and

Xu, SimMatch: Semi-supervised learning with similarity matching, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14471–14481.

70.

Zheng,

Pujar,

Lewis,

Buratti,

Epstein,

Yang,

Laredo,

Morari and

Su, D2A: A dataset built for AI-based vulnerability detection methods using differential analysis, in: Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2021, pp. 111–120.

71.

Zhou,

Lu,

Liu,

Xu,

Cheng and

Niu, HyperMatch: Noise-tolerant semi-supervised learning via relaxed contrastive constraint, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24017–24026.

72.

Zhou,

Bousquet,

Lal,

Weston and

Schölkopf, Learning with local and global consistency, Advances in Neural Information Processing Systems (NeurIPS)16 (2003).

73.

X.J.

Zhu, Semi-supervised learning literature survey, Technical Report, University of Wisconsin-Madison, 2008.

74.

Zimmermann and

Nagappan, Predicting defects using network analysis on dependency graphs, in: Proceedings of the 30th ACM/IEEE International Conference on Software Engineering (ICSE), 2008, pp. 531–540.

Con2Mix: A semi-supervised method for imbalanced tabular security data 1

Abstract

Keywords

1. Introduction

2 https://zerodium.com

2.1. Vulnerability complexity

2.4. Semi-supervised learning

2.5. Contrastive learning

2.6. Pros and cons of different baselines

3.1. Contrastive loss

3.2. Semi-supervised loss

4. Methodology

4.2. Contrastive and feature reconstruction loss

4.3. Pseudo-labeling

4.4. Downstream tasks

4.5. Algorithm

5.1. Results

5.1.1. Main experimental results

7. Conclusion and future work

Footnotes

Acknowledgment

References

²
https://zerodium.com