Abstract
Over the last decade, due to exponential growth in IoT devices and weak security mechanisms, the IoT is now facing more security challenges than ever before, especially botnet malware. There are many security solutions in detecting botnet malware on IoT devices. However, detecting IoT botnet malware, particularly multi-architecture botnets, is challenging. This paper proposes a graphically structured feature extraction mechanism integrated with reinforcement learning techniques in multi-architecture IoT botnet detection. We then evaluate the proposed approach using a dataset of 22849 samples, including actual IoT botnet malware, and achieve a detection rate of 98.03 with low time consumption. The proposed approach also achieves reliable results in detecting the new IoT botnet (has a new architecture-processor) not appearing in the training dataset at 96.69. To promote future research in the field, we share relevant datasets and source code.
Introduction
The number of IoT devices is still growing rapidly, thus Statista Inc. [1] has been forecasted an amount of 75.4 billion IoT devices in use worldwide by 2025, which is 2.4 times more than in 2020. Besides, IoT devices are being used in many fields [2], [3]. Although the number of IoT devices has grown over the years, their security has been criticized [1] because they have little or no protection at both software and infrastructure levels [4]. The poor security update management coupled with the lack of protection of IoT devices has resulted in serious network security issues that have attracted the attention of the security community as well as being exploited by cybercriminals (attackers), opens numerous windows to be victimized by diverse types of attacks, typical of which is IoT malware. According to the AV-Test Institute’s survey [5], in 2020, there will be more than 350,000 new malware samples and potentially unwanted applications (PUA) be registered every day. The popular kinds of malware include, but are not limited to, viruses, worms, trojans, spyware, ransomware, botnet, etc. Among several malware threats, botnets stand as one that can benefit the most from IoT malware. The main difference between botnet and other malware is that botnet only performs malicious actions when receiving instructions from the attacker’s C&C servers. Although botnets are not new, the IoT botnet has changed several forms of cyberattacks over the years.
Most botnets are mainly responsible for spam, phishing, data theft, while IoT botnets have been reported performing distributed denial-of-service (DDoS) attacks [6]. There are two main reasons that can explain why IoT devices are a suitable target for a botnet: First, the lack of security mechanisms that makes malware easy to install and infect. And second, the enormous number of IoT devices connected to the network provide a large number of resources that cybercriminals can utilize to carry out large-scale attacks.
IoT botnet was detected using either signature-based or behaviour-based techniques. The signature-based technique for malware detection is efficient and fast but can be easily evaded by new and obfuscated malware. On the other hand, behavior-based techniques are more flexible with obfuscated malware, but this technique is time-consuming and incapable of covering potential behaviors. Besides, most physical structure of IoT devices is constrained resources [7] (i.e., memory, bandwidth, battery, and computation) and heterogeneity (i.e., CPU architecture diversity); therefore, sophisticated and highly configurable based security protocols are not applicable [8]. To detect IoT botnet, methods based on machine learning (ML) are a promising option [9, 10]. Machine learning is one of the artificial intelligence techniques that refers to intelligent methods that used to train machines using various algorithms, and helps devices optimize performance criteria from their experience without explicit programming [11]. In the field of machine learning, the machine learning techniques can be classified into three paradigms [12]; supervised techniques, unsupervised techniques, and reinforcement learning, as shown in Fig. 1. Supervised and unsupervised learning focuses primarily on data analysis problems, while RL is better suited to comparison and decision-making problems [13].

Machine learning classes.
For the machine learning approach, the input data matters. Executable samples are analyzed for extracting the features which can be utilized for detection. There are two basic malware analysis techniques [14]: static and dynamic analyses. In static analysis, features are obtained without running the samples. Various static features can be extracted after analysis such as: Byte sequence, hash value, Strings, N-grams, Opcodes, and PE header information are extracted [15]. In contrast, in dynamic technique, executable files are performed, and runtime activities of them are collected. The runtime behavior includes registry key changes, file system operations, process execution, and network activities. Static detection of IoT botnet is an important protection method in security lines because it allows IoT botnet samples to be detected before getting into work. Therefore, this study will follow the static analysis approach. According to research [16], there are a number of studies towards reinforcement learning application in IoT security as well as detection of IoT malware However, none of them aim to detect multi-architectural IoT botnet in static approach. Therefore, at the time of writing this paper, this is the first study on this issue.
The main contributions of this work are: Provide a novel PSI-walk set based feature for IoT botnet detection as a representative characteristic of PSI-graphs and ELF files. Present a reinforcement learning model to extract the PSI-walk set from each PSI-graph, which can represent the most significant behavioral functions of the executable. Describe a data processing and classifying pipeline that can achieve satisfactory performance in accuracy and prediction time, also have ability for cross-architecture and unseen malware prediction.
This paper is organized into four sections. Except for the Introduction mentioned above, the rest of the paper is arranged as follows. In section 2, studies of detecting malware using reinforcement learning are discussed. Details of the proposed method are presented in section 3. Next, in section 4, we conduct experiments, analyze and evaluate the results. Conclusions and future studies are presented in section 5.
Traditional botnet versus IoT botnet
Botnet on traditional computers (i.e., PCs, Laptops) and IoT devices are almost similar in characteristics and infection lifecycle behavior. However, within limited resources of IoT devices, IoT botnet also has its own characteristics, which are summarized in Table 1.
The difference between traditional botnet and IoT botnet
The difference between traditional botnet and IoT botnet
Through assessment of the evolution of the IoT botnet [17–20], we can see the connection between the current types of IoT botnet. In particular, IoT botnet contains almost two basic components and four supporting components, illustrated in Fig. 2.

The infection process of the IoT botnet.
Components in the infection lifecycle include: (1) IoT botnet that performs DDoS attacks; (2) the C&C server to control IoT botnets; (3) the scanner to scan new IoT devices that can be exploited; (4) the reporting server to collect scan results; (5) The loader used to login compromised devices. Components in the infectious life cycle can be stand-alone or combined. Step 1: Botnet scans a random range of IP addresses through TCP port (such as 23, 2323) to find IoT devices with security holes to penetrate, infect and expand the botnet network. After detecting the compromised device, IoT bot launches a brute-force attack to access the device. The dictionary is usually done based on a combination of default accounts embedded in the malicious code. Step 2: After detecting a compromised device and obtaining information to authenticate and escalate privileges on the devices, IoT bot will send device-specific information to the Report server through other service ports (do not use initial scanned port). Step 3: IoT bot receives commands from C&C to check device-specific information such as IP address, hardware architecture (MIPS, ARM, PowerPC,...) Step 4: After the C&C server receives the device specification information, it will tell the Loader server to select appropriate malicious script files. Step 5: The Loader server sends to the targeted devices the appropriate malicious executable files. As soon as the malicious executable file is downloaded and executed on the device, IoT bot will delete the executable file and runs only in RAM to avoid detection. Also, the bot will shut down any accessible services such as Telnet, SSH, etc., and search for other bots to remove them to avoid affecting device resources. Step 6 and 7: Through the C&C server, the attacker can order IoT botnet to conduct distributed denial of service attacks using techniques such as UDP flood, SYN flood, GRE IP flood, etc., to a specific target.
According to the natural contrast between extracted features of malicious samples, the malware analyzing process’s techniques can be classified into static analysis and dynamic analysis. The essential criterion that is used to categorize these two analysis methods above is the execution of a binary sample. Although not requiring any execution, static analysis can represent the structure as well as depict the maliciousness of the binary sample [21]. On the contrary, the dynamic analysis aims to inspect a malware’s behavior by activating its sample in a supervised environment [22]. There is also a combination of these two techniques called hybrid analysis [23] that inherit their advantages.
Since IoT botnets have their specific characteristics that include operating on the multi-arch platform, namely, x86, MIPS, ARM, PowerPC [24], it will be costly to virtualize and supervise an entire separated environment for dynamic analysis. Furthermore, some botnets can effortlessly limit their activities inside a virtual machine. Therefore, in detecting IoT botnet, static analysis is often preferred to deal with multi-architecture issues as well as mitigate the drawbacks of dynamic analysis.
In recent decades, the number and complexity of malware have been increasing dramatically. While signature-based classifiers [25] are almost unable to detect unseen samples, researchers often leverage Machine Learning algorithms as an alternative yet effective solution [24, 26]. In general, Machine Learning has three basic learning paradigms: supervised learning, unsupervised learning, and reinforcement learning [27]. These paradigms have been applied in different processes in the flow of detecting either malware or IoT botnet. Supervised and unsupervised algorithms such as Decision Tree, SVM, SAE are often utilized as classifiers [28]. In contrast, reinforcement learning algorithms such as Q-learning and its modifications [29] have been applied in different steps of several detections flows [29, 30].
Mohammadkhani et al. [30] proposed an integration of reinforcement learning into the process of detecting malware with value set analysis technique [31] that aim to discover the highest influential features based on information gain values. In this research, reinforcement learning took part in determining the threshold of EAX stability, which was an essential component in the Dynamic Value Set Analysis detection framework [31]. Nevertheless, this paper’s contribution was limited to outlining a method for extracting behavior features related to API contact and presenting a statistical analysis of API contact from the files generated by entering the present running files system in a virtual environment. The ratio of malware samples in the examined dataset was not balanced, and the author did not propose any augmentation or balancing methods to overcome this issue.
In the published malware detection flow of Zhiyang Fang et al. [32], reinforcement learning took part in feature selection. This paper presented a Deep Q-learning-based Feature Selection Architecture that automatically applied reinforcement learning to select a small set of highly differentiated features. The reward for the agent in each step, in this case, was the accuracy of the selected supervised algorithms. This method proved its efficiency against the others (such as Raman et al., Bai et al., Kim et al.) by reducing a considerable number of selected features while still maintaining the detection accuracy.
Another research by Ngo et al. [33] proved the advantages of reinforcement learning in the feature engineering step of the entire IoT botnet detection flow. In this study, each sample, including IoT botnet or IoT benign, was represented as a PSI graph. To reduce the complexity without losing the original representative characteristics of the PSI graph, the author leveraged reinforcement learning to extract a novel feature from PSI graph called PSI walk. After feeding the extracted PSI walk to the LSTM classifier, the proposed model’s accuracy was almost equal to the classifier’s accuracy that runs on the original PSI Graph (97.1% versus 98.7%). Another plus point of this research includes the proposal of augmenting the generated PSI walk dataset.
Mohammad Alauthman et al. developed an effective reinforcement learning-based detection system [34], designed to detect and identify infected hosts in a P2P botnet, including new bots with previously unknown behavior and payload. In this paper, besides the features extraction and reduction, including both connection and packet-level from the extracted network traffic, the author utilized reinforcement learning as one of the two essential modules in the online detection phase. In more detail, Reinforcement Learning module played an essential role in augmenting the dataset for the off-line training stage by filtering and gathering novel samples in the real-world operating context, which dynamically improved the detection system throughout time. By combining reinforcement learning in the online detection phase with proper network traffic reduction technique, the detection rate of this model was as high as 98.3%, while the False Positive Rate remained as low as 0.01%.
Furthermore, Xiao et al. proposed a malware detection scheme with Q-Learning that proved the undeniable role of reinforcement learning in solving engineering issues of the malware detection problem [29]. Several reinforcement learning algorithms modifications were applied in this research. Firstly, Q-learning took part in the process of achieving the optimal offloading rate without knowing the trace generation and radio bandwidth model of the neighboring IoT devices. The detection accuracy was increased by 40% while the detection latency was reduced by 15% and 47% was the increased utility of the mobile devices compared with the benchmark offloading strategy in [24]. Secondly, by utilizing both the real defense and virtual experiences generated by the Dyna architecture, Dyna-Q-based malware detection scheme as proposed in [29] not only improves learning performance but also reduces the detection latency by 30% and increases the accuracy up to 18% compared to the detection with Q-learning. The PDS-based malware detection scheme as developed in [29] leverages the known radio channel model to accelerate the learning speed and utilizing Q-learning to study the remaining unknown state space. Therefore, the detection accuracy increased by 25% compared with the DynaQ based scheme in a network consisting of 200 mobile devices. Conversely, this scheme and its modifications only leverage reinforcement learning techniques during the data transferring process, not directly into the classification task.
We propose an approach that generates PSI-walks from PSI-graphs via Reinforcement learning and feeds PSI-walks’ augmented dataset into classic machine learning classifiers for IoT botnet detection. In this section, we present the detail of our proposed method, in which use PSI-walks as features for all components of data processing.
Overview
In this section, we present our proposed method’s structure, including all main components and the preprocessing phases. Our proposed method consists of three main steps, as described in Fig. 3. Firstly, we generate the PSI-graph dataset from our executable dataset and feed it into a reinforcement learning block. The reinforcement learning algorithm considers each PSI-graph as the environment and generates a PSI-walk set representing IoT executable behavioral characteristics. Next, the PSI-walks dataset is cleaned to remove duplicate samples, augmented to increase the total number of training samples. A data augmentation algorithm is proposed, which processes PSI-walks by random scrambling, deletion and insertion. Finally, we apply the doc2vec [35] model to vectorize PSI-walk data and feed the dataset to multiple machine learning classifiers for evaluation.

An overview of the proposed method.
We have inherited the approach to represent IoT executables with PSI-graphs proposed by Nguyen et al. [21]. A PSI-graph is an illustrated image of executing pipeline in its source code. Each vertex of the PSI-graph represents a hardcoded or imported function while each directed edge represents a function call. Hence, a PSI-graph can provide us with useful information about executables’ behaviors in its running time.
The previous approaches focus on the structure of the whole PSI-graph or its subsets called PSI-rooted subgraphs [36]. These approaches can achieve a good classification benchmark, but they all require much more computing resources of either time or storage space. Researchers also cannot fully understand the features obviously or explain them in a particular way. This paper focuses on a novel way to process the PSI-graphs by extracting the most informative walks in a graph. We know that an IoT botnet malware’s PSI-graph contains both malicious components and benign components. For example, a Mirai botnet that tries to connect to Command and control (C&C) servers is a normal behavior equivalent to benign applications’ remote server connecting. Nevertheless, the DOS attack triggered by attacking functions is a malicious behavior. We consider walks in the PSI-graph to represent a sequence of functions possibly run by the executables in their runtime. These walks are ranked, and the top of them are extracted to be representative features for their executable.
Figure 4 illustrates the PSI-walk of a Gafgyt botnet. We can easily understand how this botnet runs via its PSI-walk. First, the function “main” calls the function “processCmd” with parameters which is a command received from C&C server. “processCmd” activates a Telnet scanner via the function “TelnetScanner” for bot spreading. This function calls two imported function “sockprintf” and “send” to return the scan result to the C&C server. Besides, as previously noted, we consider a set of PSI-walks to represent the executables. We do not precisely know the real quality of PSI-walks generated by the following reinforcement learning model, so we consider the top of the best walks not to leave out any valuable PSI-walk.

An example of PSI-walk.
Figure 5 shows that the best PSI-walk set can provide much more useful information. Moreover, our used classification method is based on natural language processing, so the PSI-walk has to be in the exact form. That means we have to eliminate all PSI-walk whose all vertices are encoded or obfuscated. These thoroughly obfuscated PSI-walks can cause data mismatch because an encoded vertex only appears in that sample. The problem of encoding or obfuscation techniques is drawback of our approach and static analysis-based method.

An example of PSI-walk set. If we only consider the best PSI-walk extracted from reinforcement learning model, we only get the spreading behavior of botnet. The third PSI-walk re resents for a TCP flooding attack, so it is very characteristic and necessary for classification algorithms. Hence, using a best PSI-walk set of each PSI-graph may outperform using a best single PSI-walk.
To find the expected PSI-walk set of a PSI graph, we apply double Q-learning [37]. Using double Q-learning instead of the well-known vanilla Q-learning is to prevent overestimations of action values in such an environment as PSI-graph. Experiments show that our proposed double Q-learning algorithm can converge to the optimal policy and perform well to confirm by analyzing the final results.
To extract PSI-walk sets, we consider our work to find the most optimal walk to achieve as high a reward as possible. The main components of the double Q-learning model are defined as follows.
Agent is a robot standing on one vertex of the PSI-graph. The agent interacts with the environment to update its two Q-tables. When the training phase ends, the agent generates the PSI-walk set by listing all the vertices it goes through with the most optimal and near-optimal policies.
Environment is each of the PSI graphs. The agent stands and moves in a PSI-graph. Furthermore, we let the model decide the agent’s first vertex to stand by connecting all non-leaf vertices to a new root vertex. The agent stands first on this root vertex and has to decide the next vertex. In the final steps, this root vertex is eliminated from the PSI-walk sequence.
Action is a set of all possible moves the agent can make. Thus, the action space is all vertices of a PSI-graph. However, we control that the agent can only take valid actions moving to a neighbour vertex of the current vertex. To avoid the agent is stuck in one vertex’s self-loop, so the valid actions do not include moving to move to a self-loop.
State is defined as a dictionary containing the current vertex the agent is standing on and a list of visited edges. The environment uses this list to calculate the reward, which is explained below.
Reward is how we measure the success or failure of an agent’s action in a given state. To some extent, we can consider returned reward as a weight of importance of the vertex the agent is moving to. Below are rules to calculate reward in all situations.
An informative function that plays essential roles in a program may call a lot of other additional functions. For example, function “processCmd” in a BASHLITE botnet source code is in charge of receiving command input from the main function and depends on that command input, and the function triggers an attack by calling attacking functions. There are many attacking functions of such many diverse types of attacks as TCP flooding, UDP flooding, HTTP flooding, etc., so vertex “processCmd” has many edges to another vertex in PSI-graph. Moreover, the more critical a function is, the more external libraries’ functions it may calls. Therefore, we assume that the weight of each vertex depends on the outdegree of that vertex.
a vertex with deg+ (v) = 0 is called a sink, since it is the end of each of its incoming arrows. Some vertex such as “printf” is a sink. the degree sum formula states that:
A PSI-graph is a directed multigraph so that a vertex may have a majority of its outdegree coming from its self-loops. Self-loops are not considered as much informative as connections to other vertices. Therefore, a vertex with many self-loops receives smaller reward than one with fewer self-loops. A vertex whose number of self-loops is equal to its outdegree get a reward of 1, so the agent still moves to it if there is no other better path. In summary, given the number of self-loops of a vertex v, loop (v), moving from vertex u to vertex v in the PSI-graph receives a reward of R, where:
However, we encourage the agent to move to a vertex having self-loops rather than moving to one with no self-loops. Therefore, a vertex with self-loops receives a double reward, which is R′ = 2 * R.
In addition to that, due to the name convention of some programming language, especially C used popularly in embedded programming, a vertex whose name starts with “_” or “__” seems not to be one of the main functions of the program, so those vertices only receive a reward of 1.
Finally, when travelling in PSI-graphs, the agent may go into cycles. To break a cycle at some points and try other walks, we decrease the received reward of a visited edge. It means that moving by a visited edge receive only a reward of
Termination. Model runs through 1000 episodes. In each episode, the agent stops learning when either there is no way for the agent to move or the agent has finished 50 steps of moving.
Algorithm 1 show the pseudo-code of training phase. After running algorithms, the optimal policy is stored in two Q-tables. Below is the algorithm used to extract the PSI-walk set with the optimal Q-tables.
We use algorithm 2 to extract the best PSI-walk set of a PSI-graph. When following the policy, at some point, we take the second-best action instead of the best action. If the generated walk has a total reward more significant than a haft of the maximum reward the agent can get, we append this walk to the final PSI-walk set.
As we stated earlier, we do not travel into self-loops. As a result, to show these self-loops, we duplicate vertices having self-loops in the final PSI-walks.
After generating PSI-walk sets, we see that many botnet PSI-graph samples have the same PSI-walk sets. That is because the PSI-graphs belong to botnet samples of the same malware families, so they have similar source codes. Significantly, some botnets are compiled to different instruction architectures, so their PSI-graphs are similar too. Therefore, for the classification task, we have to remove duplicate PSI-walk sets. As previously noted, we only use a PSI-walk set with at least one exact vertex, so we have to remove all fully encoded samples.
After cleaning the data, our dataset shrinks. To have enough samples for the classification task, we propose a data augmentation algorithm to enrich the dataset. We consider each PSI-walk set as a document and each PSI-walk as a sentence. Hence, we apply a natural language processing-based data augmentation algorithm described in Algorithm 3 below. Of course, we only apply data augmentation to the training set.
Our proposed data augmentation algorithm implements three different techniques. First of all, each PSI-walk set is shuffled randomly. Next, if the PSI-walk set has a bigger size than 1, we randomly delete one PSI-walk from the set. Finally, we choose a haft of the current PSI-walk set randomly and concatenate it with a random haft of the previous PSI-walk set. By applying this algorithm, we can make the augmented PSI-walk dataset three times bigger than the original one.
Classification
As noted previously, we use a set of PSI-walks as representative features of an executable. A PSI-walk set is a document of a PSI-walk sentence. Therefore, we have to convert that text data to vector form. We use the doc2vec model –a well-known document embedding model in natural language processing to vectorize the dataset. Particularly, distributed memory doc2vec (PV-DM) is used to generate feature vectors for PSI-walk documents. The dimension of the feature vectors is 64.
Afterwards, the vector dataset is preprocessed with feature selection and scaling before being fed to machine learning classifiers. A wrapper method using a linear support vector machine to evaluate each feature’s importance weight is applied to select the most crucial features. Next, we transform the PSI-walk dataset so that its distribution has a mean value of zero and a standard deviation of 1. The formula is the following.
Finally, the dataset is fed to popular classical machine learning classifiers for the classification task, including Naïve Bayes (NB), Decision Tree (DT), k-Nearest Neighbor (kNN), Support Vector Machine (SVM) and Random Forest (RF). We apply 5-fold cross-validation and hyperparameter tuning with the parameter grid to the training set to achieve the best model of each machine learning algorithm. The final models are evaluated on the testing set, and the best model is chosen based on the testing result.
In this section, we evaluate the proposed method by experiment.
Experimental Design
Research Questions
In this section, we evaluate our approach by answering the following research questions (RQs): RQ1: How effective is the proposed method on different datasets? RQ2: How effective is the proposed method in multi-architecture IoT botnet detection? RQ3: How effective is the proposed method in zero-day IoT botnet detection?
Datasets
In this paper, we evaluate the proposed method on four datasets: 3955 IoT botnet samples from IoTPOT [38] dataset that were obtained period between October 2016 to October 2017. 3807 IoT botnet samples form VirusShare [39] dataset. 11998 IoT botnet samples from University of Central Florida [40] dataset that were obtained in the period of January 2018 to late January of 2019. 3089 benign samples collected from online repositories and IoT SOHO. At the same time, the benign samples were collected from (https://pkgs.org/) and (https://github.com/azmoodeh/)
All the files of the four datasets are Executable and Linkable Format (ELF) binaries and double-checked with VirusTotal [41] engines to ensure the true label of each sample.
We follow the proposed method of H.Trung et al. to generate PSI-graph from ELF executables. Some corrupted executables which cannot be disassembled by IDA Pro [42] are removed from the dataset. Hence, the dataset of generated PSI-graph has a fewer number of samples than the original executable dataset.
The dataset of the PSI-graph is fed to reinforcement learning to generate the PSI-walk dataset. Our final PSI-walk dataset has 17819 malicious samples (out of a total of 19760 malicious samples) and 2803 benign samples (out of a total of 3089 benign samples). To evaluate the ability for detecting unseen malware, we split the malware dataset to get a small unknown malware dataset. This unknown malware dataset only has samples of botnet families appearing after Mirai. In short, we have three PSI-walk datasets: malware, benign and unknown malware. The malware dataset contains samples from 6 different IoT botnet family: LightAidra/Aidra, Darlloz, Tsunami/Kaiten, Spike/Dofloo/MrBlack, Mirai and Gaygyt/BASHLITE. The unknown malware dataset contains samples from 5 malware classes: Hajime, Brickbot, Reaper, Persirai and other botnet families. Our samples come from multiple instruction architectures, including ARM, MIPS, Intel 80386 (i386), x86-64, PowerPC, SPARC, Motorola 68000 and SuperH.
Implementation and Environment
In our experiments, reinforcement learning model has learning rate α = 0.9, discount factor γ = 0.9. In the doc2vec model, distributed memory (PV-DM) is used. The model is trained with initial learning rate α = 0.025, in 100 epochs, using negative sampling. The output feature vectors have a dimension of 64. These feature vectors are fed into machine learning classifiers using class weights to handle the imbalanced dataset problem.
Experiment is conducted in a virtual machine with following configuration: Ubuntu 16.04 64-bit, Intel Xeon CPU, 8GB RAM. Source codes are written in Python language.
We also perform experiment on another stronger machine to compare prediction time of our model. Table 2 below shows the processing time of testing phase. We can observe that the reinforcement learning model can converge in acceptable time depending on the power of machine’s CPU. With 6 cores, the PSI-walk generation time is approximately 3 seconds. Prediction time, which is spent on applying doc2vec, preprocessing and classifying by machine learning model is as short as 0.11 seconds in 4-cores machine.
Comparation prediction time on different machines
Comparation prediction time on different machines
The following terms are used to evaluate the precision-efficiency of the proposed method: Condition positive (P) is number of malicious samples in the dataset. Condition negative (N) is number of benign samples in the dataset. True positive (TP): the number of predicted malware samples correctly classified as malware. True negative (TN) is the number of predicted benign samples correctly classified as benign. False positive (FP) is the number of predicted malware samples incorrectly tagged as benign. False negative (FN) is the number of predicted benign samples incorrectly tagged as malware.
The following metrics are used to evaluate the precision-efficiency of the proposed method: True positive rate (TPR) or Sensitivity, Recall is the number of predicted malware samples correctly classified as malicious divided by total malware. This metric shows the probability of detecting malware samples.
False positive rate (FPR) or Fall-out: the number of predicted benign samples falsely marked as malicious divided by total benign samples. This metric shows the probability of false alarm.
Accuracy (ACC): the ratio of the number of corrected samples to the number of both malware and benign samples. However, accuracy is not trustful in imbalanced dataset.
Precision or positive predictive rate (PPV) is the ratio of predicted malware that are correctly malware.
F1-score is the harmonic mean of Precision and Recall (TPR). F1-score is a combining metric to estimate the entire model performance and is defined as follow:
Area Under the Curve (AUC): This metric is based on Receiver Operating Characteristic (ROC) curve to evaluate a model. AUC is often used in binary classification problem with imbalanced dataset.
Accuracy is appropriate for testing data mining with a balanced dataset. Our dataset is imbalanced because the number of benign samples is small. In order to evaluate the imbalanced data, the criteria of Recall, Precision, and F1 score are used [43].
In this section, we evaluate the proposed approach on each benchmark malware dataset and the benign dataset. Afterward, we merge all small dataset to form the full dataset. The classification result is shown in Table 3.
Evaluating effectiveness with the various dataset
Evaluating effectiveness with the various dataset
We observe that our proposed method achieves good classification results in all datasets with high accuracy and low false positive rate. The result is promising that although the dataset is imbalanced when we merge big malware dataset with small benign dataset, the proposed method achieves high ROC AUC score of 97.18% and 97.23% which mean the model is not too biased to the major class. Furthermore, we can observe that the bigger dataset, the better classification result. Chances of overfitting and the generalization error decrease, but with smallest dataset of VirusShare, the proposed method still achieves acceptable classification results of 94.68% accuracy.
The full dataset has the best classification result with 98.03% accuracy, 97.23% ROC AUC and low FPR of 3.99%. Table 4 and Fig. 6 shows the detail classification result with several machine learning algorithms on the full dataset.
Classification result on the full dataset

ROC curve on the full dataset.
We can see that not all machine learning performs well on the full dataset. K-Nearest neighbors and Support Vector Machine outperforms other algorithms. The reason is that doc2vec groups the vectors of similar documents together in vector space. Therefore, kNN which assumes that similar thing exist in close proximity is very suitable to classify such vectors in output vector space of doc2vec. The classification result shows that the proposed method is able to achieve high precision and efficiency with proper machine learning algorithms.
In short, our proposed method is effective with various datasets. Moreover, classification results are better when our method is applied to bigger datasets. The result achieved with the full dataset shows that our proposed method has a reliable performance in IoT bonet detection.
We evaluate the proposed method on different architecture datasets. As noted above, our collected full dataset (include four datasets) has multiple CPU architectures such as ARM, MIPS (including MIPSEL which is little-endianess MIPS), Intel 80386 (i386), x86-64, PowerPC, SPARC, Motorola 68000 and SuperH. We conduct experiments by mixing these architectures to create train/test datasets. This experiment is called cross-architecture evaluation. However, because some less popular architectures do not have enough samples, we only conduct experiment with ARM, MIPS, i386 and PowerPC. We train our model with dataset of an architecture and test it on dataset of another architecture.
We observer that our model trained on MIPS dataset can have a reliable performance on other architectures. One reason for that is that many IoT botnet samples are just a recompiled version of other architectures with the same source code as the original one created by attackers. Another reason may be because our proposed method is PSI-graph based and PSI-graph is extracted with information in ELF’s source code rather than information involved with its CPU architecture.
This experiment is important to prove that our proposed method is effective in multiple CPU architectures. Moreover, we do not need to train multiple specified models for each architecture but just one model with can predict well on any architecture, even on architectures which do not appear in the training dataset.
RQ3: How effective is the proposed method in zero-day IoT botnet detection?
This experiment is performed to evaluate our proposed method’s ability to detect unseen IoT botnet families. As noted above, a small part of the malware dataset which contains IoT botnet samples appearing after Mirai botnet in 2016 is splitted to create the unknown dataset. This dataset has 51 samples of Hajime (2016), BrickerBot (2017), Reaper (2017), Persirai (2017) and some less famous botnet families.
The classification result show that our model can correctly predict 50/51 samples (98.04%). The only wrong predicted sample is a Ganiw botnet malware (first seen in June 2016) which has the MD5 hash value 33bf49f9fdc1901a36a079f241a87ad0.
Although our unknown dataset is not completely new to be called zero-day, to some extent, we can believe that our proposed method can detect effectively newer IoT botnet malware. It is because the fact that many botnet malwares nowadays are variants of older malwares, or they are developed from some older famous such IoT botnet malwares as Mirai. Therefore, these newer malwares may have some similar characteristics as the older ones which are analyzed and trained with our proposed method. Therefore, our proposed method may not be able to detect completely new malware families. This problem can be solved just by collecting and adding more samples of IoT botnet malwares to the training dataset.
Comparison with other approaches
In this section, we compare our proposed model performance with some other approaches. However, at the time of writing this paper, many studies on the application of Reinforcement Learning in IoT botnet detection have not yet been published. Meanwhile, related studies (see Section 2) do not publish the source code and databases to conduct experiments and comparisons. In addition, because other reinforcement learning-based appoaches are hard to reproduce, so it is difficult to compare with our proposed method. Therefore, we compare the proposed method with different machine learning models such as LSTM, CNN.
First, we train a Long short-term memory (LSTM) model using a single sequence of best PSI-walk [33]. Secondly, we train a Convolutional neural network (CNN) to detect IoT botnet malware based on gray-scale image of executables. The binaries are converted to gray-scale image and then classified by transfer learning with EfficientNet (a state-of-the-art neural network in image classification). The comparation result is shown in Table 6.
Performance on cross-architecture model
Performance on cross-architecture model
Compare our proposed model performance with some other approaches
In this paper, we have presented a novel, PSI-walk based approach for IoT botnet detection. We propose a reinforcement learning model to extract PSI-walks from PSI-graphs and use PSI-walk sets as features for the classification task. We also present used data processing techniques to achieve a high classification benchmark. In summary, our main contributions are the following: (1) we present a novel PSI-walk set based feature for IoT botnet detection as a representative characteristic of PSI-graphs and ELF files; (2) we propose a reinforcement learning model to extract the PSI-walk set from each PSI-graph which can represent the most significant behavioral functions of the executable (3) We propose a data processing and classifying pipeline which can achieve good performance. The model achieves the best accuracy score of 98.03%, have the ability for cross-architecture and unseen malware prediction. Besides, the prediction time is relatively short. However, the approach still has some limitations. Because the approach is based on PSI-graph, it cannot be applied to sophisticatedly obfuscated IoT botnet samples. The approach is based on natural language processing of vertex name, so it can be fooled by adversarial attacks. Therefore, in the future work, we will try to experiment hybrid methods which may overcome such limitations of the current method.
Footnotes
Acknowledgments
This work has been supported by the CyberSecurity Lab, Posts and Telecommunications Institute of Technology, Hanoi, Vietnam and funded by the Ministry of Science and Technology, Vietnam grant number KC-4.0-05/19-25.
