Abstract
Nowadays, the number and types of IoT devices are increasing rapidly, which leads to an expansion in the attack surface of this kind of device. Besides, the number of Botnet malware on IoT devices also grows with a lot of new variants. This context leads to an urgent demand for an effective solution in detecting new variants of IoT Botnet malware. There have been many studies focusing on IoT Botnet malware detection using static and dynamic analysis. In particular, the combination of the dynamic method with machine learning has shown outstanding advantages to detect IoT Botnet variants. However, the preprocessing of behavioral data originated from malware is still complicated, and the number of input vector dimensions of the machine learning model is still huge. In addition, these models also consume a lot of resources and have limited detection capabilities. Besides, dynamic analysis studies based on system calls mostly use call frequency characteristics and have not effectively exploited IoT Botnet malware’s life cycle characteristics. In this paper, we propose the Directed System Call Graph (DSCG) feature to sequentially structure the system calls. This DSCG graph will be vectorized and used as an input for building a malware analysis model based on popular machine learning classifiers such as KNN, SVM, Decision Tree, etc. Experiments on the datasets demonstrate that the features extracted from this graph have low complexity but still ensure high accuracy in detecting IoT Botnets, especially with newly emerged IoT Botnet families. The proposed model was evaluated with ACC = 98.01 % , TPR = 97.93 % , FPR = 1.5 % , AUC = 0.9961 on a dataset of 5023 IoT Botnets and 3888 benign samples.
Introduction
With the rapid increment in number and type, IoT devices have brought many benefits to people. However, there are a lot of information security issues associated with these devices. Specifically, an increasing number of attacks on IoT devices with malware such as Tsunami [1], Bashlite [2], Mirai [3], Bricker Bot [4], MrBlack [5], etc. Therefore, the protection of IoT devices against the mentioned families of malware is extremely necessary. In Vietnam, many IoT devices (including home wifi routers) have been infected with these IoT Botnets. Nowadays, Vietnam Posts and Telecommunications (VNPT) Group has provided internet services through about 8 million home wifi routers. To ensure the stability of Internet service, one of VNPT’s missions is to protect these routers from the IoT Botnet malware. In our case study, we develop an IoT Botnet detection solution for 8 million online VNPT iGate routers. Therefore, the main requirement for this challenge is to deploy a light-weight and accurate model for Botnet IoT detection at edge computing.
There are two main approaches to detect IoT Botnet: static analysis and dynamic analysis. Static analysis involves parsing source code or binary code without having to execute software. Dynamic analysis involves the real-time running of software to analyze its behavior in a monitored environment [6]. In our case, static analysis is not suitable for the detection of IoT Botnet in real time. Therefore, the solution to apply dynamic analysis in combination with edge computing has been considered. With dynamic analysis, data collected from the iGate Router device includes two main types: network stream and system call sequence. These data are critical inputs to network-based and host-based intrusion detection solutions. Network-based intrusion detection (in the NIDS) usually only detects malicious behavior once the device has been successfully infected. At that time, this device becomes part of the IoT Botnet and begins to perform the functions of a “Bot”. NIDS for IoT Botnet is often ineffective in preventing the intrusion of IoT Botnet. Thus the HIDS based on device behavioral analysis is more effective in this case. One of the features for effectively analyzing system behavior is system calls. All IoT Botnet behavior, if want to execute, needs to make system calls. So that, we have chosen to collect the system call sequence data from the iGate Router and process it at the edge to quickly detect and issue the warning of the IoT Botnet malware as shown in Fig. 1. The IoT Agent is integrated into iGate Router to collect the system call data of this device. Once collected, the system call sequence will be sent to the local IoT Analyzer device located in each management area. The IoT Analyzer device will analyze and detect signs of the IoT Botnet malware. Then, this device will send the detection results to the central monitoring server to broadcast the warning to the person responsible for receiving the alert (administrator, users). To have the fastest detection and warning for the system, the edge processing model must ensure the detection requirement with high accuracy, low resource consumption, and can be deployed with an embedded device. In this paper, we propose a method of IoT Botnet malware detection based on system call that assures the above requirements. Our main contributions are: Proposing the Directed System Call Graph (DSCG) feature to sequentially structure the system calls. This DSCG graph will be vectorized and used as an input for building a malware analysis model based on popular classifier machine learning sets. Evaluating the effectiveness of the proposed method with a dataset with 4875 IoT Botnet and 2919 benign samples. The model based on this proposed method will be integrated into the IoT Analyzer to solve the actual problems of VNPT.

The proposed edge protection solution for Router iGate.
The structure of this paper is as follows: section 2 :RelatedWork and Background consists of the background and survey of related works, section 3 TheProposed Method describes the proposed approach, section 4 Evaluation experiments and evaluates the results, section5 Discussion discusses the results, and section 6 Conclusions concludes the paper.
In this section, we will present the reasons for using a sequence of system calls (in Section 2.1), preprocessing methods of this data (in Section 2.2) for the detection of IoT Botnet malware. Related works for the detection of IoT Botnet malware on the same dataset will also be surveyed in this section (show in Table 1).
Using system calls in IoT Botnet detection
With the use of dynamic analysis, data describing malicious behavior of malware can be collected from many sources such as network flows, system calls, device resource usage, etc. Collected more means more redundant, confusing data that affects the accuracy of malware detection. Therefore, it is necessary to build a pre-processing, analysis, and malware detection algorithm based on a large amount of collected data, including noisy data, with a low false-negative rate (FNR). Among behavioral data, system calls provide more detail and less redundant data about the executable’s interaction with the system. The system call is the only method that allows executors to ask the system to perform operations such as process initialization, create or access files, handle input and output devices (I/O), open port, connect to a network environment, etc. To perform malicious acts, IoT Botnet needs to use functions and services from the operating system. For any malicious action, such as creating a new directory, loading a file executable, writing to RAM, or opening a network connection, interaction with the operating system is required through a system call. Therefore, to characterize the behavior of the malware, it is important to track the sequence of system call events during the execution of the IoT Botnet. Different IoT Botnet families have different malicious behavior targets, but all of these targets will be revealed by examining the system call traces.
Preprocessing system calls
The system call is a very effective input parameter in detecting various types of attacks on IoT devices such as DoS, DDoS, stealing privacy information, stealing passwords, scanning ports [7]. Many studies have described methods of preprocessing the system call data serving the general machine learning/deep learning model of malware detection. Specifically, there are two main trends: Consider the system call data as a discrete, independent data that are not related to each other (categorical attribute). In other words, this data is generated from independently and identically distributed distributions. Consider the system call data as data with sequential attributes, closely related to each other and in order.
With the first view, researchers apply data processing methods with discrete characteristics to extract features for the training of the classification model [95]. Discrete properties often have a set of possible values (domain) and are a set of discrete values in this “domain”. The classification of the new data sample A depends on how to check whether the discrete attribute values of sample A are in the D (X) value set of the class X discrete attribute: value (A) ∈ D (X). This test is a simple logic that requires very little computation. However, because the set has a permutative attribute, the ordering attribute values of sample A are excluded during preprocessing data with this view. This loss can lead to inaccuracies in the classification results of the model even though it consumes less computational resources.
Contrary to the first view, the researchers consider the system call data to have the attribute of the order of occurrence. Like “word” in a paragraph written in a certain order. If the words in the paragraph are randomly permuted (without affecting how often they appear in the paragraph) it will be difficult to grasp their meaning. Similar to time-series data, such as the audio signal in a conversation, frames in a video, etc. all need a way of processing data so as not to lose sequential attributes that computers can learn. The most common treatment is to convert “word” into a numeric vector (word to vector) that characterizes its meaning in the context known as Word Embedding. This technique ensures that similar words will have roughly the same vector value. A typical implementation of this technique is Word2vec, which is a 2-layer neural network with only one hidden layer, the input is a large “corpus” and produces a vector space up to several hundred dimensions. Every single word in “corpus” is associated with a corresponding vector in this vector space. The vectors representing the “word” are defined so that the words with the same context in the “corpus” are placed close together in this space. Predict the meaning of a word based on previous occurrences in “corpus”. There are two common ways to build word2vec: Using context to predict goal (Continuous Bag of Words (CBOW)) and using a word to predict target context (Skip-gram). The general model of word2vec (both CBOW and Skip-gram) is based on a fairly simple neural network. Let V be the set of all words or vocabulary with n different words. The input layer is represented as one-hot encoding with n nodes representing n words in the vocabulary. The activation function only at the last layer is softmax function, loss function uses cross-entropy loss. In the middle of the two input/output classes is the hidden layer with size = k, which is the vector that will be used to represent words after training the model. However, word2vec has some disadvantages including: Unrecognized “words” if these words not in the training data set. To overcome this problem, the space for the dictionary built for “words” is often very large, consumes a lot of resources to store and process. The neural network has a huge dimension: assuming the word vector has 300 attributes and the number of words is 1000 words, the neural network has a weighted matrix size of 300 × 1000 = 300,000 dimensions. Implementation of the Gradient Descent algorithm will be very slow.
With the research and evaluation of the system-call data preprocessing methods mentioned above, we chose to treat the system call as sequential data. This treatment is intended to avoid the loss of an important characteristic of the sequence of calls. On a theoretical basis, a program’s behavior can be modeled based on information from the system invocation as they interact with the operating system. However, sequence representations of system calls will be difficult to rearrange or add to system calls, a more flexible representation that can capture the relationships between them is required. Using a graph showing the relationship between calls can solve that problem. Such graphs are commonly referred to as System Call Graphs (SCG). Therefore, to increase the efficiency of the detection of IoT Botnet, we propose the Directed System Call Graph (DSCG) feature to sequentially structure the system calls. This feature allows to efficiently exploit information from system calls and overcome the limitations of using the word2vec model. In the next part, the paper will present the content of the survey and evaluation of related works. From there, we propose a method to improve the shortcomings of these studies.
Related works
Nowadays, the IoT Botnet detection problem has attracted the attention of many researchers. Researchers around the world have presented their methods as publication of scientific papers. However, to have a basis to compare with the proposed model, we will conduct surveys and evaluation of studies using the dataset from Honeypot IoTPOT [8]. This dataset is popularly used in Botnet IoT detection studies. In particular, to collecting network flow data, the author also collected ELF samples of popular Botnet IoTs such as Mirai, Bashlite, etc. These related works are presented below. The contents of related works using IoTPOT dataset in detecting IoT Botnet are summarized in Table 1.
Alhaidari et al. [9] present the method of detecting IoT Botnet malware based on the hidden Markov model (HMM). The author builds an algorithm to automatically re-select the appropriate input features for the HMM model to increase the ability to accurately detect this type of malware. For IoTPOT dataset, the author selected 31/41 available features to evaluate HMM model with ACC = 94.67 % , FPR = 1.88 % , TPR = 47.86 % , F - measure = 0.5519. Although the accuracy is over 90%, the TPR rate below 50% shows the limitation of the HMM model proposed by the author. This model can not detect samples that are IoT Botnet malware with a rate of over 50%.
Alhanahnah et al. [10] presented the IoT Botnet malware detection method based on the K-means clustering model with printable strings of input file source code. These printable strings are extracted by decompiling the source code with the IDA Pro. Then the N-gram algorithm is applied to these strings to extract the sequential feature. Finally, the author applies the K-means clustering model to cluster existing data and classify new data. With K = 100 (parameter of K-means) and N = 4 (parameter of N-gram), the proposed model is capable of detecting the IoT Botnet malware with accuracy ACC = 85.2%.
Karanja et al. [11] present the method of detecting IoT Botnet malware based on popular machine learning algorithms (KNN, Random Forest, Naïve Bayes) with features extracted from multi-grayscale images. of the malicious binary file. Each malicious binary file is converted to an 8-bit vector and stored as a two-dimensional array. This two-dimensional array is similar to a multilevel grayscale image with pixel values in the range [0, 255]. From this multi-level gray image, five characteristic values were extracted including Entropy, Contrast, Correlation, ASM, and IDM. The proposed method is evaluated with the IoTPOT data set giving the best results when using the Random Forest algorithm with ACC = 95.38 % , AUC = 0.97, F - measure = 0.95.
Meidan et al. [12] presented the Deep Autoencoder IoT Botnet detection method to process collected network behavior data. The author directly executes each sample on 9 types of real devices and captures the generated network data stream. From this network stream data, 115 characteristic values are extracted. Then, the author uses Deep Autoencoder to “compress” the input feature vector. This “compression” has two purposes, including reducing the vector dimension and finding constraints with information-beneficial features when performing sample classification. The results of training this deep learning network with the IoTPOT dataset show the rate of FPR = 1.7%.
Shobana et al. [13] propose to use RNN regression deep learning network to detect IoT Botnet malware based on system call data. The system calls after being collected through the virtual machine environment are preprocessed using the algorithm N-gram and TF-IDF (Term Frequencyinverse Document Frequency). The weight N = 10 is applied to both the N-gram algorithm and TF-IDF to process system call data. After performing feature extraction, this feature vector is fed into the RNN network for training and evaluation. After training with Epoch = 5, the proposed model results in ACC = 98.71% assessment. However, due to the limitations of the virtual machine environment used to execute and collect system calls, the author can only use 200 malicious samples and 200 benign samples in Dataset IoTPOT to evaluate the proposed model.
Nguyen et al. [14] presented the method of using the graph embedding technique Subgraph2vec to preprocess the function call graph for the IoT devices for the detection of the Botnet IoT. The author uses static analysis of the source code of the executable and selects the calls containing the Printable String Information (PSI) to construct the PSI-graph. Then proceed to extract the PSI-Rooted sub-graph from this PSI-graph to remove noise. Finally, the author applied the Subgraph2vec technique to reduce the characteristic vector dimension of the PSI-Rooted graph from 533564 to 140 dimensions. The proposed method has proven its effectiveness with the measured values ACC = 97 % , AUC = 0.96. However, this method of building graphs depends entirely on static analysis, there are existing limitations such as dependence on the ability to decompile the source code, collect samples. Besides, the choice of constructing the graph only from functions that contain PSI makes it possible to omit functions with encoded or non-PSI parameters. Also, the construction of sub-graphs from the PSI-graph following the exhausting algorithm with all vertices and depth d will increase the complexity of processing this graph.
Related works for detecting malware in IoT devices
Related works for detecting malware in IoT devices
In summary, the main limitations of these models analyzed above include: The training and testing model datasets in the above works are not extensive enough and sometimes only use data from a certain type of device with a fixed microprocessor architecture (ARM, MIPS) or fixed operating system (popularly Android). The reason is due to the limitation of the malware execution simulation environment. Most of the researches using system call only pays attention to the characteristics of the frequency of calls appearance and has not exploited effectively the sequence characteristics of system calls. Related works using graphs in the processing of IoT Botnet system call sequences have shortcomings such as feature vectors that have too large dimensions, feature values are complicated to perform classification effectively. The accuracy of classification results when applied is not high, depending on the processor architecture and operating system platform.
Therefore, we proposed a feature extraction from a directed system call graph (DSCG) based on graph embedding technique and experimenting with simple machine learning classifiers. The purpose is to exploit the characteristics of the system call to serve IoT Botnet dynamic analysis effectively and solve the above problems. In the next section, we present specific descriptions of the proposed method.
TheProposed Method In this section, we present the overview architecture, the main components of the proposed Botnet IoT malware detection model, and the content of the DSCG graph feature extraction method. We use the IoT Botnet sample “Mirai” (with MD5: b86725ae00e40c21d8- b374aa4d550bd9) to illustrate the steps in the proposed model.
Overview
The proposed method includes four main components: System Calls Collector (SCC), Directed System Call Graph Generator (DSCGG), Data Preprocessing (DP), Machine Learning Detector (MLD). First, we define the DSCG as follows:
V is a set of vertex v
i
representing system calls with the same name and arguments; E is a set of edges e
k
connecting from vertex v
i
to vertex v
j
of the graph, E ∈ V × V. With loops, it counts as only one edge.
The architecture of the proposed method is illustrated in Fig. 2. The proposed method has 4 main steps corresponding to the above components, including:

The architecture of proposed method.

The system calls for Mirai malware collected from V-Sandbox.
In the first step, the ELF file is executed in the Sandbox environment to collect the requested system calls. This Sandbox environment must ensure the best conditions for the IoT Botnet malware to exhibit as much behavior as possible. Therefore, we use V-Sandbox [15] to fully collect the interactive behavior of the IoT Botnet malware with the system, including the system call. Next, redundant information will be deleted from the data collected by the system call through the simple data preprocessing function. The result of this process is that the system call sequence of the input ELF file has been minimized to graph the DSCG system call. In the second step, the directed system call graph (DSCG) is constructed from the system call sequence obtained in the first step. The structure of the DSCG graph will be described in XML language and stored in a file in the “gexf” format (Graph Exchange XML Format). This output graph format can be evaluated by different graph embedding algorithms. In the third step, we preprocess the DSCG graph data before introducing it into the data classifier machine learning. If you simply convert these graphs into a set of vectors (i.e. one vector per graph), then implementing the classification algorithm on these graphs is very complicated and takes a lot of time. These graphs in the experiment have a large number of nodes and edges, so the dimensionality of the vectors converted from the graph has a correspondingly complex size. Therefore, we apply the graph embedding method to efficiently extract information about the characteristics of the DSCG graph and reduce the dimensions of this vector. This is done to improve the accuracy and reduce complexity when detecting the IoT Botnet. In the fourth step, after extracting the appropriate feature set, these features are used to train and evaluate the ability to detect IoT Botnet based on popular machine learning algorithms such as KNN, SVM, Decision tree, Random Forest. The evaluating results have shown the efficiency of the proposed method. The main components of the DSCG-based IoT Botnet malware detection model are presented in the next section of the paper.
Malicious behavioral traces of the IoT Botnet are collected by tracking the execution of each input ELF pattern in a simulated environment which is similar to a resource-limited IoT device fully connected to the Internet. The dynamic analysis method applied to this proposed approach aims to collect system call logs. ELF file samples are performed in a simulated environment and record the system’s call behavior using the built-in operator. With the recorded system calls, the interaction of the IoT Botnet malware is demonstrated with the behavior of attempting to connect to the C&C server; detect other devices through open service ports such as SSH, Telnet, FTP. The simulation environment for this task must ensure the requirements include: support multiple CPU architectures, create a C&C server connection, provide a dynamic link library, collect full behavior interaction with the system during the run of the executable. Therefore, the V-Sandbox [15] environment was used to collect malicious behavior information of IoT Botnet malware. The output of this step is a log file that records all system calls made by the IoT Botnet malware. Specifically, V-Sandbox allows monitoring not only the system calls the process is carried out by IoT Botnet malware but also its child processes.

DSCG graph for Mirai malware.

DSCG graphs in gexf format.
A detailed description of the directed system call graph (DSCG) generation is shown below. In this step, we will construct DSCG graphs for each input executable based on the system call sequence information obtained from V-Sandbox. After obtaining the system call sequence of the input sample from the V-Sandbox (illustrated in Fig. 3), each vertex of the DSCG graph defined above will be labeled with the format of “PID:System call name(Argument)”. From there, the list of the edges of the graph are described as a pair of adjacent graph vertices is automatically generated and stored for conversion to the output file format “gexf” (shown in Fig. 3). An illustration of a segment of the generated DSCG graph based on the system call sequence from Mirai malware is shown in Fig. 5.
As Fig. 4 shows a partial DSCG graph of the Mirai malware, the malware starts running with the “execve()” system call. At this time, the PID of Mirai is 4882. After that, different system calls are called consecutively with corresponding parameters like “rt_sigprocmask(SIG_BLOCK,_[INT],_NULL, _8)”,“open (/dev/watchdog,_O_RDWR)”, “open(/dev/ misc/watchdog,_O_RDWR)”, “chdir(/)”,etc. Child processes (with PID is 4824, 4825, 4826) are generated when the Mirai malware tries to use the “fork()” system call. These child processes perform various tasks such as scanning other devices in the network, connecting to the C&C server, etc.
The algorithm for constructing DSCG graphs is presented by pseudo-code in Algorithm 1. With the algorithm’s input arguments include “logf” which is a sequence of system calls collected from V-Sandbox and stored in the log file, “pid” is the process identifier of the first process that the ELF executes, and “u” is the name of vertex used to connect between PID branches in DSCG. First, the algorithm initializes the sets of vertices and edges to the empty set. Next step, it reads the lines in “logf” one by one to get the list of system calls collected by V-Sandbox. The above list of system calls will generate corresponding vertices and edges for the DSCG graph. If the “fork” or “clone” system call exists in this list, then a recursive procedure is performed to generate the branch of the DSCG graph for the child process to be created.
Algorithm 1 DSCG generation algorithm
logf is a sequence of system calls are collected from the V-Sandbox;
pid is the process identifier of the first process that the ELF executes;
u is the name of vertex used to connect between PID branches in DSCG;
E is the list edges of DSCG;
V is the list vertex in DSCG;
SC is the list system-calls of ELF.
E← ∅
V← ∅
SC← ∅
v← ∅
i ← 0
While not EOF (logfile)
SC [i] ← readline (logfile, i)
i ← i + 1
End While
j ← 0
While j ≤ i
v ← pid + SC [j] . name + SC [j] . args
V ← V ∪ v
If u ≠ None
E ← E ∪ (u, v)
EndIf
u ← v
If SC [j] . name = fork ∨ clone
child _ pid ← SC [j] . return
E ← E ∪ Gen _ DSCG (logf, child _ pid, u)
EndIf
j ← j + 1
EndWhile
EndFunction
end algorithmic
As shown in Fig.4, the DSCG graph is somewhat different from the existing system call graphs. This new feature is that the DSCG graph can accurately represent the system calls of child processes generated by the IoT Botnet malware when calling the fork () or clone (). Traditional system call graphs, due to limitations of the Sandbox environment, only collect system call data of a main process of malware. Therefore, most traditional systems call graphs usually have no branching. By keeping track of all child processes generated from the malware’s main process, data on the most complete malware system calls is collected by V-Sandbox. Combined with Algorithm 1:DSCGG, the DSCG graph will have branches that describe the content of the system calls of the child processes. Therefore, the DSCG graph contains more complete information about the system call sequence of the IoT Botnet malware than the existing system call graphs.
Preprocess the DSCG graph data
To be able to incorporate DSCG data into machine learning/deep learning classifiers, it is necessary to represent the entire graph as a fixed-length feature vector. However, each graph in reality has different numbers of vertices and edges, bringing specific characteristics for each input pattern. Therefore, a technique for converting highly complex graph data into an uniform feature vector of a fixed length is needed. One of the most commonly used problem-solving techniques today is graph embedding. The graph embedding technique maps the graph into Euclidean space, where graphs with similar structures or subgraph elements are located close together. From there, each graph will be represented by a feature vector. Through this technique, feature vectors are extracted from graphs and fed into machine learning classifiers. To evaluate the efficiency of representing system call data via DSCG graphs, the graph embedding techniques implemented in this paper include FEATHER [16], LDP [17], and Graph2vec [18].
Rozemberczki et al. [16] present the FEATHER method used to construct the characteristic functions of a graph based on the distribution relationship of node neighborhoods. The author proves that the isosceles graph has the same aggregate feature function as the mean of a node characteristic function. Hence, a FEATHER characteristic-value function can be used to represent graphs in the classification and clustering problem. The graph-level FEATHER function is calculated as the arithmetic mean of the node-level FEATHER values.
Cai et al. [17] present the Local Degree Profile (LDP) method in extracting characteristics of the graph based on the "degree" of the vertex with the first neighboring vertices (connected by an edge). The author’s graph notation is G (V, E) where V is the set of vertices and E is the set of edges. For each vertex v ∈ V, denote DN (v) is a set of “degree” values of vertices u adjacent to vertex v : DN (v) = degree (u) | (u, v) ∈ E. Each vertex v in the graph G will extract 5 characteristic values, including: degree (v), min (DN (v)), max (DN (v)), mean (DN (v)) and std (DN (v)). This process is repeated for all vertices of graph G. The characteristic values of the node with default dimension 32, are joined together in order into the feature vector having 5 × 32 = 160 dimensions. This is the input characteristic vector for machine learning algorithms such as SVM.
Related preprocessing studies for graphically structured data have mainly focused on distributed representations of nodes and subgraphs. However, the forementioned treatment representing the entire graph as a feature vector of fixed length is quite complicated, consuming a lot of processing resources. Besides, the use of graph kernels has proven effective for feature extraction. However, these graph kernels use hand-created features, which hinder the automation of the process of finding the graph kernel. To overcome this limitation, Narayanan et al. proposed an embedded graph deep learning network called Graph2vec [18] to explore data-oriented scatter representations of graphs of arbitrary size. The author considers the graphs similar to the text composed by rooted subgraphs and extends the Skipgram method for text (Doc2vec) to the application for graph processing. The embeddings of graph2vec are trained in an unsupervised learning method. Hence, they can be used for any task like graph classification, clustering, and even for supervised machine learning methods. Narayanan tests show that Graph2vec achieves significant improvements in classification and clustering accuracy over substructure representational approaches and competes with modern graph analysis methods.
Classify by machine learning algorithms
Popular machine learning classifiers such as SVM, KNN, Decision Tree, Random Forest were trained and evaluated the features extracted from the DSCG graph. In the course of training and evaluating the classifiers, we simultaneously adjusted the main parameters of the machine learning models above to find the best possible fit model. The main parameters that we conduct to adjust are described in Table 1. The rest of the parameters are used by default.
Specifically, the selected machine learning algorithms all have hyperparameters. These hyperparameters are choices to guide the model training process for a particular data set to achieve the highest efficiency. However, there is usually no formula to precisely determine each hyperparameter value for machine learning models with a particular dataset. Therefore, it is necessary to find the optimal value of these hyperparameters for the best model performance. There are different methods to perform this optimal tuple search process. In this paper, we use the popular search method, Grid search cross-validation (GridSearchCV) [19]. This method builds multiple models using different combinations of hyperparameters and sees which combination is the best. The candidate values of these hyperparameters are shown in Table 2.
Adjusted machine learning model parameters
Adjusted machine learning model parameters
In this section, the proposed model will be implemented on hardware for evaluation (in Section 4.2). Experimental results based on metrics (in Section 4.3) with the dataset (in Section 4.1) will be summary at the end of this section.
Dataset
To evaluate the proposed model, the dataset plays an important role. At the present, there are not many data sets for IoT Botnet as wel las resource-limited IoT devices. In addition, in existing datasets, the number of malicious samples are often dominant the benign ones. So in this paper, we have collected and built our dataset. IoT Botnet malware samples were collected from three main sources from IoTPOT [8], Virustotal [20] and VirusShare [21]. The IoT Botnet malware samples collected in our dataset include variants of the Mirai, Bashlite, Tsunami, Spike, Dofloo, MrBlack, and other malware families. Benign samples are extracted from firmware images of resource-constrained IoT devices on the network such as wireless routers, IP cameras, smart lights, smart-box TVs, etc. We use FMK (Firmware Modification Kit) tool [22] and Binwalk [23] for the job of extracting these benign files. Our dataset containing 5023 IoT Botnet and 3888 benign samples of multi-platform microprocessor architecture (including MIPS, ARM, X86, PowerPC, etc.) was collected and used for experimentation. The labels of the samples in the dataset are identified with the help of the Virustotal API [20]. The dataset is described in Table 3.
Distribution of dataset
Distribution of dataset
For our evaluation, we used a server with an Intel Xeon E5-2689 2.6 GHz CPU with 16GB of RAM, a hard drive with 2 TB of storage. Virtualization environment V-Sandbox is installed according to the source code-shared by the author from Github. The graph embedding algorithms were implemented by the source code shared at Github include FEATHER, LDP, and Graph2vec. The machine learning algorithms such as SVM, KNN, Decision Tree, and Random Forest were installed through the Python language with support from the Scikit-learn library [24]. The hyperparameters for these ML models defined during the experiment are shown in Table 1. We used 3 scenarios to divide the dataset as in Table 2 to train and evaluate the proposed feature. We divided our dataset into sub-dataset to train and evaluate the efficiency (validation method) for the following reasons: The number of samples of IoT Botnet malware for IoT devices is still little; Prevent Overfitting; Evaluate the ability to detect new variants of the IoT Botnet malware.
In each sub-dataset, sample files (including benign and malware) are executed in the V-Sandbox environment. In this sandbox, the system calls invoked by the input file will be collected and written to the log file. Then, the corresponding DSCG graph will be built based on the list of system calls contained in this log file according to Algorithm 1. The result of building the DSCG graph is saved to a GEXF format file. To preprocess the DSCG graph, the graph embedding techniques implemented in this paper include Feather, LDP, and Graph2vec. Finally, the extracted feature sets will be effectively evaluated through popular machine learning models such as KNN, SVM, Decision Tree, etc.
The candidate values of hyperparameters for ML models
The candidate values of hyperparameters for ML models
Dataset split scenarios
In this section, the paper uses 4 metrics to evaluate the model’s effectiveness including Accuracy (ACC), True Positive Rate (TPR), False Positive Rate (FPR), and Area Under the Curve (AUC). The basic metrics include true-positive (TP) means that the predicted category is malware and the actual category is also malware; false-positive (FP) indicates that the prediction category is malware but the actual category is benign and false-negative (FN) means that the prediction category is the benign and the actual category is exactly malware. True-negative (TN) indicates that the predicted category is benign but the actual category is benign. We used TPR and FPR to evaluate the model more accurately with datasets that have an imbalancement between the malicious and the benign groups.
Experimental results
For the first evaluation scenario, the training dataset consists of malware samples of almost the Bashlite family (in addition to a few other IoT Botnet samples). The testing dataset is the Mirai family (appearing after the Bashlite family over time). The results of evaluating the classification ability of the proposed feature with this sub-dataset are illustrated in Table 6 (visualization in Fig. 6). The ROC graph showing the results of the evaluation of the models on this scenario is illustrated in Fig. 7. With this sub-dataset, the proposed feature set built from the DSCG graph has the best performance with a model that combines the Random Forest classification algorithm and the LDP graph embedding algorithm (AUC = 0.9932). In this case, the metric values TPR = 0.9510 and FPR = 0.0076 of this model show the efficiency of detecting new IoT Botnet malware variants based on features from the DSCG graph.
Evaluation results with the first scenario
Evaluation results with the first scenario

Experimental results in scenario 1.

Typical ROC of ML models using Graph2vec in (a), Feather in (b), and LDP in (c) for the first scenario.
For the second evaluation scenario, the training dataset consists of malware samples of almost the Bashlite and Mirai family. The testing dataset is other IoT Botnet malware such as Tsunami, Spike, Dofloo, MrBlack, etc. The results of evaluating the classification ability of the proposed feature with this sub-dataset are illustrated in Table 7 (visualization in Fig. 8). The ROC graph showing the results of the evaluation of the models on this scenario is illustrated in Fig.9. With this sub-dataset, the proposed feature set built from the DSCG graph has the best performance with a model that combines the Random Forest classification algorithm and the Graph2vec graph embedding algorithm (AUC = 0.9971). In this case, the metric values TPR = 0.9944 and FPR = 0.0294 of this model show the ability to detect polymorphic variants of IoT Botnet malware based on features from DSCG graph.
Evaluation results with the second scenario

Experimental results in scenario 2.
For the third evaluation scenario, the training dataset consists of malware samples of almost the Mirai family (in addition to a few other IoT Botnet samples). The testing dataset is the Bashlite family (appearing before the Mirai family over time). The results of evaluating the classification ability of the proposed feature with this sub-dataset are illustrated in Table 8 (visualization in Fig.10). The ROC graph showing the results of the evaluation of the models on this scenario is illustrated in Fig. 11. With this sub-dataset, the proposed feature set built from the DSCG graph has the best performance with a model that combines the Random Forest classification algorithm and the LDP graph embedding algorithm (AUC = 0.9981). In this case, the metric values TPR = 0.9924 and FPR = 0.0098 of this model indicate the ability to accurately detect pre-existing IoT Botnet malware based on the features from the DSCG graph.

Typical ROC of ML models using features from Graph2vec in (a), Feather in (b), and LDP in (c) for the second scenario.
The above evaluation results showed that the features extracted from the DSCG graph achieved good results for detecting IoT Botnet malware (ACC = 98.01 % , TPR = 97.93 % , FPR = 1.5 % , AUC = 0.9961). With three different evaluating scenarios, the DSCG graph has been shown to have different capabilities in the IoT Botnet malware detection problem. This feature works well with simple and popular machine learning classifiers such as KNN, SVM, Decision Tree, Random Forest. The number of characteristic vector dimensions extracted from the DSCG graph with Graph2vec algorithm is 128, Feather - 250, and LDP - 160. This number of dimensions of feature vectors extracted from the graph is also less than published works, contributing to reducing computational complexity when applied to IoT Botnet detection and classification models. Specific comparison for related studies is presented in Table9. Compared with related works on the same dataset from IoTPOT, the proposed model has demonstrated the ability to detect the IoT Botnet malware through the evaluation metrics with ACC = 98.01 % , TPR = 97.93 % , FPR = 1.5 % , AUC = 0.9961. The highly accurate model with a low false-positive rate shows the efficiency in detecting the IoT Botnet malware. The proposed model also provides a feature vector with dimensions of 128, ensuring the applicability of common machine learning algorithms requiring less system resource consumption. This ensures the ability to deploy the proposed model at the edges devices in the actual problem as mentioned in the section1.
Evaluation results with the third scenario
Evaluation results with the third scenario

Experimental result in scenario 3.

Typical ROC of ML models using features from Graph2vec in (a), Feather in (b), and LDP in (c) for the third scenario.
Compare the proposed model and related works
Conclusions In this paper, we propose the Directed System Call Graph (DSCG) feature to sequentially structure the system calls. This DSCG graph is vectorized and used as an input for building a malware analysis model based on popular classifier machine learning sets. We evaluated the effectiveness of the proposed method with our dataset including 4875 IoT Botnet and 2919 benign samples. With three different evaluation scenarios, the DSCG graph has been shown to have different capabilities to solve IoT Botnet malware detection problems. This feature works well with simple and popular machine learning classifiers such as KNN, SVM, Decision Tree, Random Forest. The number of dimensions of feature vectors extracted from the graph is also less than published works, contributing to reducing computational complexity when applied to IoT Botnet detection and classification models. So that, the model based on this proposed method will be integrated into the IoT Analyzer to solve the actual problem of VNPT.
Acknowledgment
This work is supported by a project of a key national science and technology program in Vietnam, grant code KC-4.0-05/19-25.
