Abstract
With the increased assimilation of technology into all aspects of everyday life, rootkits pose a credible threat to individuals, corporations, and governments. Using various techniques, rootkits can infect systems and remain undetected for extended periods of time. This threat necessitates the careful consideration of real-time detection solutions. Behavioral detection techniques allow for the identification of rootkits with no previously recorded signatures. This research examines a variety of machine learning algorithms, including Nearest Neighbor, Decision Trees, Neural Networks, and Support Vector Machines, and proposes a behavioral detection method based on low yield CPU power consumption. The method is evaluated on Windows 7, Windows 10, Ubuntu Desktop, and Ubuntu Server operating systems along with employing four different rootkits. Relevant features within the data are calculated and the overall best performing algorithms are identified. A nested neural network is then applied that enables highly accurate data classification. Our results present a viable method of rootkit detection that can operate in real-time with minimal computational and space complexity.
Introduction
Rootkits are malicious programs that acquire root privileges on a computer. Hooking techniques allow rootkits to hide within a system without detection from traditional security mechanisms [6]. As defined by the Microsoft developers network, a hook is “a point in the system message-handling mechanism where an application can install a subroutine to monitor the message traffic in the system and process certain types of messages before they reach the target window procedure” [38]. There are, generally, two reasons for deploying a rootkit. The first is establishing remote command and control of a system and the second is eavesdropping on a system [28]. Over time rootkits have evolved, and are still used in modern attacks. In 2015 it was revealed the Chinese cybercriminal group Winnti was using rootkit technology in their attacks [45]. In 2016, Check Point Mobile Threat Prevention identified the mobile rootkit HummingBad, which could install more than 50,000 fraudulent apps each day, and display 20 million malicious advertisements [9]. According to Kaspersky Security, in 2016 the third most prominent banking malware family (Trojan.Win32.Neurevt) employed rootkit technology [56]. Rootkits are also employed in advanced persistent threat scenarios, as demonstrated by Stuxnet. Stuxnet used a Windows rootkit, the first ever PLC rootkit, process injection, and code hooking to damage centrifuges at the Natanz uranium enrichment plant in Iran [20,60]. When coupling this information with the increasing integration of technology into the automotive world [8], the aviation industry [42], critical infrastructure and legal environments [4,7,40,62], it stresses the need to identify methods that will discover rootkits before they can cause damage.
Methods proposed for identifying rootkits generally fall into three categories [55]. The first category is signature based detection, which looks for previously recorded byte patterns of known rootkits [55]. While this method is very effective, it cannot detect rootkits that have not yet been identified and recorded. The second category is integrity detection which involves periodically checking the system for unauthorized changes to file systems and operating system components [55]. The last category is behavioral detection and the focus area for this research. Behavioral detection involves identifying the presence of a rootkit by collecting data on the behavior of the system under normal conditions, and then uses the data as a baseline to identify deviations [10]. This process typically involves the use of machine learning or statistical models. In fact, a recent article noted that machine learning is the primary method for almost all non-signature based detection techniques, including network intrusion, spam identification, fraudulent activity, and malicious activity in general [39].
The contributions of this research are as follows. First We demonstrate the performance of a rootkit detection solution using power measurements from the host CPU. The experimental results indicate that this method is on par or supersedes the current work in several aspects, including the accuracy of the model and the data requirements. The algorithm is effective using data measured at low frequency random time frames with only information on the initial, minimum, maximum, intervals, and average power measurement during that time frame. Further, we demonstrate that this method is capable of detecting both user and kernel level rootkits during normal operating conditions and under stress tests. Our method is tested on four different operating systems: Ubuntu Server, Ubuntu Desktop, Windows 7, and Windows 10. The data is analyzed and results are reported on a variety of different machine learning algorithms, including decision trees, nearest neighbor, support vector machines, neural networks, regression, and ensemble methods. Lastly, a model is created using nested neural networks and discrete representations of projected data that outperforms other methods. On average, this model achieved 9.4% higher area under the curve (AUC) and 17.2% higher accuracy than the traditional machine learning algorithms tested.
The remainder of the paper is structured as follows. Section 2 presents related work in rootkit detection and learning algorithms. Section 3 discusses the algorithms used in our analysis and Section 4 describes our data collection environment, as well as any preprocessing techniques used on the data once collected. Section 5 presents an analysis of the results and Section 6 discusses the implications and future work. The last section of the paper draws conclusions.
Related work
Machine learning techniques have become common place in many applications, including malware detection. The following subsections describe related work in rootkit detection given specific features.
System call behavior
Das et al. [13] proposed an online malware detection method which they named GuardOL. GuardOL uses system call patterns to classify malware. The frequency-centralized model considers how often critical system calls are used and builds features by clustering the calls based on a set of rules that capture the behavior of malicious code. Based on these features, a multilayer perceptron is trained and used to identify malicious activity at runtime. The data set used contained 472 samples, of which 56 were rootkits. Their model, which was implemented on an FPGA, could detect 46% of malware within 30% of its execution time, and 97% of malware all total.
Dini et al. [18] designed a method for detecting malware on smartphones they called Multi-level Anomaly Detector for Android Malware (MADAM). Their behavior based method intercepts critical system calls at the kernel level and records the number of occurrences of the calls over a period of time. Their one nearest neighbor classifier could detect 100% of the rootkits tested.
Hernandez et al. [27] note cyber-attacks pose a credible threat, citing Stuxnet, which used rootkit technology. They evaluated a method of event detection based on reconstructing the phase-space formed by serialized system call timing data. Their algorithm evaluates the dissimilarity among phase-space graphs over time and successfully indicated anomalous activity.
In our recent work [37], we also evaluated system call timing data for rootkit detection. Our analysis tested two neural network architectures: feed forward and recurrent. The recurrent neural network could correctly classify 97% of system calls as either infected or not infected with a rootkit (KBeast). This accuracy was obtained using only one feature (system call time) and no preprocessing of the data. In another work [14] we evaluated the same system calls using phase-space analysis.
API behavior
Pirscoveanu et al. [50] note that attackers have a strong incentive to increase the complexity of malicious code to improve obfuscation and decrease the likelihood of being detected by anti-virus software. They suggest a method of detection based on dynamic analysis of system behavior. The behaviors used in their research included DNS requests, accessed files, mutexes, registry keys and Windows API’s. Using a Random Forest classifier, they could detect trojan malware at .98, adware at .95, and rootkits at .97 areas under the curve (AUC). They also note their classifier had a higher number of false negatives when classifying rootkits as opposed to other forms of malware.
Alazab et al. [1] compared eight classifiers for detecting rootkits based on API features. They tested the performance of Naive Bayes, K-Nearest Neighbor, Sequential Minimal Optimization with a Normalized PolyKernel, Sequential Minimal Optimization with a PolyKernel, Sequential Minimal Optimization with Puk, Sequential Minimal Optimization with a Radial Basis, Backpropagation Neural Networks, and J48 decision trees. Their results indicate Support Vector Machines with a normalized PolyKernel performed best (98.5% true rate), and the Neural Network using backpropagation performed the poorest.
Hooking behavior
Ramani et al. [52] proposed rootkit detection based on hooking. They evaluated nine classifiers to identify which performed the best. Of the nine classifiers, the most accurate was one proposed by Ramani and Jacob called the Correlation Bayes Algorithm. The correlation Bayes Algorithm yielded a classification accuracy of 87.4% using 10-fold cross validation. A Random Committee algorithm came in second with an accuracy of 86.2%, and third was Logistic Regression with 85.1%.
Lobo et al. [36] suggested a method for rootkit detection called Rootkit Behavioral Analysis and Classification System (RBACS). This method analyzes Windows process affected by rootkit hooking techniques and the API functions that were hooked. Their analysis used 78 rootkit samples obtained from the Offensive Computing website [46] that produced 11,159 inline function hooks. Using the Expectation Maximization (EM) and 10-Fold Cross Validation algorithms, their method could categorize the different rootkits into one of five categories based on labels assigned by antivirus software.
Other characteristics
Demme et al. [17] define malware as any software designed with the intention of compromising a system or the privacy of an individual or business. Their research uses behavioral characteristics from performance counters and machine learning classifiers such as K-Nearest Neighbor and Decision Trees to identify the presence of Android malware and Linux rootkits. Their method, which uses hardware modifications to support dynamic analysis, yielded an accuracy of nearly 90%.
Arslan et al. [3] performed a thorough review of machine learning methods used in the detection of mobile attacks, including rootkits. They cited over 20 published works that employed machine learning classifiers, including Support Vector Machines, Artificial Neural Networks, Random Forrest, Naive Bayes, K Nearest Neighbors, and others to identify the presence of a threat. They note these methods are most advantageous when employed against malicious code that has not yet been identified and has no recorded signature. They also identified three phases of malware detection using machine learning: feature extraction, feature selection, and classification.
Dubey et al. [19] proposed a three-step learning algorithm using K Means clustering, Bayesian classifiers, and Neural Networks with backpropagation training as a method for intrusion detection systems. Of the attacks they studied, some of them included User-To-Root attacks (U2R), where an attacker exploits vulnerabilities within a system to gain root access, which is similar to a rootkit. Using data acquired from the 1998 DARPA Intrusion Detection Evaluation program, they could correctly classify 98% of rootkit style attacks. Lakshmi et al. [35] also investigated the application of machine learning to intrusion detection systems based on User-To-Root attacks. Their analysis used a subset of the KDD CUP 99 data set, which contained 52 User-To-Root attacks, each with 41 attributes. After various preprocessing methods were employed, such as normalization and Principal Component Analysis, the data was tested on several different classifiers. Of the classifiers tested, the best results came from Rule Induction, Decision trees, and Naive Bayes.
Power characteristics
Shropshire [54] used statistical analysis of power measurements to identify hypervisors that had been compromised. He notes that while attackers can hide unauthorized activity by modifying specific components within a system, they cannot manipulate external measures of energy consumption. The proposed system, named PowerCheck, measures power consumed by a server using an external sensor, as well as the hypervisor’s reports of CPU load, network activity, and memory use. It then identifies anomalous activity by comparing predicted consumption against the actual usage. The model could detect small discrepancies in energy consumption and had minimal false positives.
In our recent [15] work we utilized power supply voltage measurements collected using a multimeter and current clamp, then extract time-serial system dynamics through the application of a phase-space algorithm. Examination of phase-space graph features between nominal and infected data during stress test on an Ubuntu Server OS were used to classify the state of the system. Our results indicate that the algorithm could successfully detect a rootkit through power measurement analysis. The current manuscript extends this work by evaluating different operating systems, different rootkits, different hardware, and a variety of learning algorithms.
Algorithms
The learning algorithms used in this research can be broken down into seven categories: tree based methods, regression, support vector machines, nearest neighbor, discriminant analysis, neural networks, and ensemble classifiers. A general description of these methods is provided in the following subsections. During training, all algorithms used 5-Fold cross validation, which is a method used to estimate how well a predictive model generalizes to different data sets [33]. This section provides a general background for readers not familiar with the above algorithms. Readers familiar with statistical and machine learning prediction and classification algorithms may skip this section.
Decision trees
Decision trees are hierarchical models that split data into homogenous groups based on recursively applied rules [51]. The training objective is to find a set of decision rules that can be used to predict a class label. Our research uses the Gini-Simpson index (1) of diversity [24]:
Regression
Logistic regression is a special case of regression where the dependent variable is dichotomous. There are different methods of logistic regression for classification, including ordinary least square, weighted least square and maximum likelihood estimations [2]. Our analysis focuses on maximum likelihood estimation, which attempts to find a value for an unknown parameter to maximize a function. In this case the function corresponds to the likelihood function, which is the function produced by reversing the roles of parameters in the Probability Density Function (PDF). For binary classification
Support vector machines
Support Vector Machines, or SVMs, are a technique for classification and outlier detection. SVMs seek to find an optimal hyperplane to separate classes by maximizing the margin between the closest points among classes [41]. For a feature vector
Nearest neighbor
The nearest neighbor classification method is one of the most well-known techniques in machine learning. Despite its simplicity, it is one of the most widely used algorithms in many aspects of data mining [34]. Data is divided into samples and is represented by an attribute vector. This vector contains features that describe the sample [25]. Classification is achieved based on the similarity or distance function used in the algorithm. The accuracy of the K-Nearest Neighbor algorithm is strongly dependent on the choice of similarity or distance function [58]. In this research, we evaluated 3 distance functions: Euclidean (9), Minkowski (10), and Cosine (11).
Discriminant analysis
Discriminant analysis seeks to determine which variables or features discriminate between two or more groups. It generally assumes the observations in each class are generated by a probability distribution specific to said class [22]. Our analysis considers two types of discriminant analysis: linear and quadratic. Linear discriminant analysis assumes a multivariate normal distribution and is used when the variance–covariance matrix does not depend on the sampled population. Classification rules are based on a Linear Score Function of the population means for each population. Quadratic discriminant analysis follows a similar method, but is used for heterogeneous variance–covariance matrices [22]. The score functions for Linear (12) and Quadratic (13) Discriminant analysis are as follows:
Ensemble classifiers
Ensemble classification methods evaluated in our analysis include Boosted Trees and Bagged Trees. The method of Bagged Trees takes small bootstrap samples, which are data sets consisting of randomly drawn instances of the original data set [24], and builds a decision tree. Then, the process is repeated several times, building many different trees from many different sets of data. Classification is based on the average predictions from all trees. Boosting originated from the AdaBoost algorithm [23]. Boosting uses a collection of weak classifiers to yield one strong classifier [16]. The ensemble tree methods have a maximum branching factor of 20 and use 100 learners.
Neural networks
Neural networks have been widely used for function approximation, classification, clustering, and dynamic system modeling [61]. There are generally three aspects of a neural network: the architecture, training method, and transfer function. Network architectures include feed forward, cascade, recurrent, and many others. Some examples of training methods are scaled conjugate gradient descent, Levenberg–Marquardt, and Bayesian regularization backpropagation. There are also many types of transfer functions. Of those, the most common are the sigmoid functions. The general layout for a sigmoid transfer function is:
Methodology
The following subsections describe the equipment, hardware, and software used in our analysis. It also describes the steps involved in data collection.
Data collection equipment
Power data was collected and recorded using a Fluke 289 Multimeter with a Fluke i30 AC/DC current clamp. The multimeter records changes in power readings given a certain threshold. The default threshold of the Multimeter is 4% and was not adjusted.
Software
Our analysis focused on three operating systems. The first operating system was an Ubuntu 14.04 Server. Data was collected from the operating system prior to infection and after infection. The rootkit used on the Ubuntu operating system was KBeast [29]. KBeast is a Loadable Kernel Module rootkit capable of hiding itself, files, directories, and processes. It is also able to log key strokes and prevent removal of files and deletion of itself. KBeast is compiled using the setup script and then loaded into kernel space.
The second operating system was Windows 10 Pro. The rootkit used on the Windows 10 operating system was FUTo [46]. According to the authors, FUTo can manipulate the PspCidTable without using any function calls, and uses Direct Kernel Object Manipulation (DKOM) to hide certain objects. The FUTo file came with a prebuilt executable, and the system was infected by running the executable as an administrator.
The third operating system was Ubuntu Desktop version 16.04.1. Ubuntu Desktop was infected with the Azazel rootkit [46]. Azazel is a user level rootkit capable of hiding files, directories, processes, creating backdoors, and avoiding detection. Azazel was compiled using the scripts provided at Packet Storm [47].
The final operating system used for out of sample testing (Data Collection 3) was Windows 7 Professional with Service Pack 1. The rootkit was a version of Windows NT rootkit found at [48].
Data analysis was performed using Matlab version R2016A. Specific tool boxes used were the Neural Network Toolbox, Classification Toolbox, and the Statistics and Machine Learning Toolbox.
Computer hardware

Collection environment for power analysis.
Figure 1 depicts the collection environment for the power analysis. The current clamps were placed around the two yellow power cords going from the motherboard to the CPU. The first computer used in our analysis was a Dell Optiplex 755 with an Intel Core 2 Duo CPU. The hard drive running the Ubuntu 14.04 Server operating system was a Western Digital WD1600AAJS with 160 gigabytes of memory. The hard drive running Windows 10 and Ubuntu Desktop was a western digital WD2500LPLX with 250 gigabytes of memory. Hard drives were erased using a Diskology Disk Jockey Pro Forensic Edition. The Disk Jockey was also used to copy the clean version of Ubuntu Server back to the hard drive. A clean version of both Windows 10 Pro and Ubuntu Desktop were reinstalled using a bootable image stored on a SanDisk Ultra USB 3.0.
For out of sample testing (Data Collection 3), we used a Dell Optiplex 790 with an Intel Core i3-2130 CPU with 3.40 GHz. The hard drive used was a Western Digital WD2500LPLX 250GB.
This section describes the different data sets collected for analysis. For clarity, the data sets are separated and labeled. Each data set is composed of eight collections. First, two collections are performed on the uninfected system. Then the system is infected with the appropriate rootkit and two more collections are performed. Then the hard drives are erased, the appropriate operating system is reinstalled, and the process is repeated.
Data collection 1
Data collection 1 was collected using the Fluke 289 Multimeter and Fluke i30 AC/DC current clamp. Again, the data instances were vectors of the following features: initial reading above/below threshold, duration of reading, minimum reading in the interval, maximum reading in the interval, average reading in the interval. The number of instances produced by a single collection is based on the number of times the values go above or below the threshold. This data collection process followed the following steps:
Place the power clamp around the CPU power cords; Start recording on the Multimeter; Power on the Computer and log-in; Create a directory in C; Create a text file in the new directory; Open the text file and type “This is a test for rootkit detection!”; Save the file; Return to the C directory; Delete the created directory; Power off the computer and stop the multimeter from recording.
This process was performed on the Windows 10, Ubuntu server, and Ubuntu desktop operating systems. This resulted in three data sets from collection type 1. After logging on-to the computer, we waited until the multimeter showed little variability before conducting the above steps. This was to ensure that the initial start-up programs did not contaminate the readings. On average, the recording would begin within 60 to 120 seconds after logging in.
Data collection 2
Rather than a manual process, a stress tested was performed on the computer while the multimeter was collecting data. This data collection process followed the following steps:
Place the power clamp around the CPU power cords; Power on the Computer and log in; Begin recording on the multimeter; Begin the stress test; After a given time frame stop the stress test and multimeter; Power off the computer.
The stress test lasted approximately five minutes. The stress test used on the Windows 10 operating system was HeavyLoad [30], and the stress test used on the Ubuntu operating systems was Stress-ng [32]. This process was performed on the Windows 10 and Ubuntu desktop operating systems. For this collection we were only concerned with the behavior of the system during the stress test. Therefore, the multimeter was not activated until after the boot sequence and at the same time the stress test was initiated. This is because the boot and shut down sequence would be approximately the same as the collection without the stress test (collection 1). This resulted in two data sets from collection type 2. Together, collection types 1 and 2 produced five data sets. The details of the collections are described in Table 1. We note the actual data set sizes may slightly vary due to some samples being unusable.
Description of individual data collections
Description of individual data collections
Data collection 3 was utilized for out of sample testing. This data collection was the only collection which used the Windows 7 OS on the Dell Optiplex 790. The following procedure was followed when collecting the data:
Place the power clamp around the CPU power cords; Start recording on the Multimeter; Power on the Computer and log-in; Open Notepad and type for 5 minutes, then close Notepad; Open MS Paint and draw/type for 5 minutes, then close; Open calculator and perform random calculations for 5 minutes, then close; Open Control Panel and navigate through various files for 5 minutes, then close; Open MS Paint and draw/type for 5 minutes, then close; Open Notepad and type for 5 minutes, then close; Open Word Pad and type for 5 minutes, then close; Open MS Paint and draw/type for 5 minutes, then close; Open Control Panel and navigate through various files for 5 minutes, then close; Open Word Pad and type for 5 minutes, then close; Power off the computer and stop the multimeter from recording.
A single clean and a single infected sample were gathered. The recordings lasted approximately 1 hour.
Labeling and normalization
Each data instance (feature vector) was assigned a class label of 1 for uninfected data and
Much of the time and computational complexity involved in some machine learning algorithms, including those used in our analysis, is in the preprocessing of data. This process can become exponentially more difficult in real time and ‘Big Data’ scenarios. Therefore, one emphasis of this research is to identify features and algorithms that need as little preprocessing as possible. To this end, the only preprocessing performed was standardization and normalization using the following formulas:
Descriptive statistics of clean versus infected average power measurements on Windows 10 during stress test
Descriptive statistics of clean versus infected average power measurements on Windows 10 during stress test
A critical aspect of machine learning is estimating the importance and need of specific features. Including irrelevant features can, at best, slow training time and, at worst, reduce accuracy. In order to evaluate the features used in our analysis, we employed the RelieFF algorithm, which evaluates and weights the importance of features using a K Nearest Neighbor method [53]. In our case, we chose K to be 10. Figure 2 shows the results of RelieFF on a Windows 10 data set during a stress test. The fifth feature (minimum value during the interval) is assigned the highest weight, and is, therefore, considered the most important. The remaining features are all within a small range of importance. Because all weights were greater than zero, we choose to include all features in our analysis. The remaining data sets followed similar patterns.

Evaluation of predictor importance. 1 = Initial Reading, 2 = Interval Duration, 3 = Maximum, 4 = Average, 5 = Minimum.
The following subsections describe the in sample and out of sample testing performed on the data.
In sample test
Table 3 presents the results of the analysis of data collection 1 from an Ubuntu Server operating system. The best performing algorithms were the neural networks and tree methods, and of the tree methods, the best were the ensemble tree methods. The nearest neighbor using Euclidean distance also performed well.
Analysis of Ubuntu server
Analysis of Ubuntu server
Analysis of Windows 10
Table 4 illustrates the results of the analysis of data collection 2 from the Windows 10 operating system. Like the Ubuntu Server, the neural network, tree based methods, and ensemble tree methods performed the best. The support vector machines with linear and quadratic kernels also performed well, although the quadratic SVM required significantly more training time. The nearest neighbor algorithm using Euclidean distance also performed well.
Table 5 shows the results of the analysis on the Windows 10 operating system using the HeavyLoad stress test (data collection 2). The stress test was given one of the two cores and run as an administrator to allow reads and writes to the C drive. Again, the neural network and tree based methods performed the best. The support vector machine with a Gaussian kernel also performed well, but required considerably more training time. The nearest neighbor methods performed well, with the Euclidean method performing best.
Analysis of Windows 10 with stress test
Table 6 list the results of the analysis on Ubuntu Desktop for data collection 1. The neural network, boosted and bagged trees, and complex tree performed the best, with an average accuracy of approximately 82%. The remaining two tree-based methods performed reasonably well, as well as the nearest neighbor method with Euclidean distance. Surprisingly, all methods apart from linear and quadratic discriminant analysis and the support vector machine with a cubic kernel performed well on the Ubuntu Desktop using the stress test (data collection 2), and the results are listed in Table 7.
Analysis of Ubuntu desktop
Analysis of Ubuntu desktop with stress test

Difference in clean and infected data.
Figure 3(a) is an example of the average power measurement reading per interval in one collection observation on Windows 10 Pro during a stress test after the previously discussed preprocessing techniques used in the analysis. The rootkit influenced the boot and shutdown procedures as well as caused a higher average reading throughout the stress test.
Figure 3(b) shows the duration of the random time intervals for clean and infected Ubuntu Desktop. The readings for the infected operating system were higher than those of the uninfected system. The average interval for the uninfected reading was approximately 4.3, and 6.2 for the infected reading. The largest variation occurred during the boot up and shutdown procedure.
To identify the best performing algorithms for each test, we calculate an average score based on the order of performance of accuracy, AUC, and training time. As an example, given two algorithms, algorithm one has the best accuracy, the second-best AUC, and the second best training time. Therefore, the first algorithm would be
Intuitively, discriminant analysis methods did not perform well because they are somewhat dependent on data being approximately normal and covariance homogeneity [21]. We performed the Kolmogorov–Smirnov test for the null hypothesis that the data comes from a standard normal distribution and found said hypothesis to be rejected at a 5% significance level, indicating the data was not normally distributed. Decision trees and neural networks overcome these limitations by making no assumptions about the data. The intuition behind the nearest neighbor methods performing reasonably well is based on the clustering of the data. Figure 4 depicts the clustering of the average power values based on interval duration. The tight clusters with few outliers allow nearest neighbor methods to perform well.
While the accuracy and AUC of the five best performing algorithms showed promising results, it was anticipated that a method yielding an accuracy greater than 95% would be found. After the initial test were conducted, we attempted to improve the results using Principal Component Analysis (PCA). However, after PCA was performed on the data, there was no improvement in any of the models. We sought to represent the data in a way that would facilitate an increase in performance for any learning algorithm, as well as a learning algorithm that would be best suited for the specific data sets. The requirements sought from such a method were low computational costs to facilitate speed and low memory requirements.

Clustering of data.
A commonly used method for finding separability in data is by projecting the data into a higher dimension. This is based on Cover’s Theorem [12], which states that given a set of data which is not linearly separable, one can transform it into a set that is linearly separable with high probability by projecting it into a higher dimensional space via a non-linear transformation provided the space is not densely populated. Kernel methods do this implicitly using specific functions. However, they depend on a kernel function that is generally not adaptable [5]. Still, it is also possible to explicitly project data into higher dimensions.
To project the data into higher dimensions, we employ the Self Organizing Map. As previously mentioned, SOMs are a type of unsupervised neural network that clusters data by projecting the input on-to a 2-dimensional grid based on similarity and topology [57]. The SOM algorithm was performed on each of our data sets using all five features. The SOM topology was a four-by-four hexagonal grid. The resulting output for each input vector was a 16-dimension vector containing a single one and 15 zeros. The one corresponds to the cluster the input was assigned to. This could be used to project the data into a higher dimension. By adding the vectors produced by the SOM to the already existing vectors, we would go from five dimensions to 21 dimensions. However, there are several issues with such a strategy. When projecting data into a higher dimension, it is possible to project too far, resulting in a decline in performance. Data that is separable in dimension D may not be separable in dimensions greater than D. Performance can also decline due to the curse of dimensionality and the cost of training the algorithms would be increased with the increase in dimensions. To overcome these issues, we treat the 16-dimension vector as a single binary number. We then translate the binary representations into the base 10 decimal representation. This yields a single unique value that represents a 16-dimension projection. The data was then normalized and feature scaled in the same manner as the original data.
To identify an optimal model for the data, we evaluated the existing results. The best performing learning algorithm, over all, was the neural network and the second best was the ensemble tree methods. Therefore, the algorithm we chose to use on the data to improve the performance was a type of hybrid ensemble neural network that we call a nested neural network. Nested neural networks consist of an ensemble of neural networks each of which is dedicated to learning a specific feature. The output of these networks is fed into a larger neural network that makes the final decision based on the results of the smaller networks. This is inspired by biological studies showing that the human brain functions not as a single massive network, but as a group of small networks each dedicated to a specific task. There was a total of five smaller networks, each dedicated to its own feature. These neural networks were recurrent, meaning they contained feedback loops from the previous response into the current input. The output of these networks formed a binary vector. In the same manner as the SOM, the binary vector was converted to the base 10 decimal representation.

Proposed ensemble nested network.
The nested network is depicted in Fig. 5. The first layer is the original normalized data. This data is then fed into the second layer, which contains the smaller specialized neural networks and the SOM. The SOM received all the features of each vector, while the individual networks received only a single feature. Then, the two binary vectors are converted to the base 10 representation and fed into the final neural network.
The recurrent networks were trained by taking all training data sets and combining them into one large data set. Then, to improve generalization of the network, we generated extra data points by adding or subtracting random Gaussian noise to the existing points. The random Gaussian noise was in the range of plus or minus one tenth of the standard deviation. This created 10 extra points for every existing point in the original data. We also rounded the data points to two decimal places. The final neural network contained a single hidden layer with 20 neurons containing normalized radial base transfer functions. Table 8 depicts the results of the top 5 algorithms and the results of the proposed model for all five data sets. In all cases, the proposed model raised the accuracy to nearly 100%. On average, this model achieved 9.4% higher AUC and 17.2% higher accuracy than the traditional machine learning algorithms tested.
Average performance of top 5 algorithms and proposed model
One important note is the test and training sets contained both clean and infected data, and there were approximately the same number of clean and infected samples (see Table 1). In a real-world scenario, the classifiers would be trained prior to use, and data would be collected and tested at the discretion of the user. Therefore, test sets would contain only clean data or only infected data. We performed tests to simulate such an environment using the proposed model. The algorithm was trained on both clean and infected data and tested on either clean or infected data. The algorithm made a prediction of either 1 or
To further evaluate the effectiveness and real world applications, we performed an analysis using the five top algorithms and the proposed model on two extra data sets. The first data set consisted of the three data sets under normal conditions combined into one and the second data set consisted of the two data sets using the stress test combined into one. The goal of this analysis was to see if the proposed model could not only differentiate between clean and infected data, but also data originating from different operating systems and from different hardware. Among the top five algorithms the average accuracy was 87% for the combined data set under normal conditions and, for the combined stress test data, an average accuracy of 85% was achieved. Our proposed model achieved 97% and 96% respectively.
A primary concern of any learning algorithm is overfitting. Overfitting occurs when an algorithm learns a specific training data set too precisely. This often leads to poor performance when new data is introduced. For the previous test discussed, several methods were employed to overcome overfitting, including cross validation on the classification models, hold out validation on the neural network models, rounding data points to lower decimal places, and generating data with random noise added. Still, validation on completely new data is need to identify how well the algorithm generalizes. To that end we performed one final test to evaluate the proposed model. Several aspects needed to be addressed to evaluate how well the model generalized, including 1) a different rootkit, 2) a different operating system, 3) different hardware, 4) different use case.
The nested network was trained and validated on the data from Windows 10 and Ubuntu desktop, and tested on the data from collection 3. A total of 322 data points were collected. Of those, 216 came from the infected operating system, and 106 came from the clean operating system. Of the 216 infected data points, the algorithm correctly identified 201 as infected, corresponding to a true positive rate of 93% and a false negative rate of 7%. Of the 106 clean data points, the algorithm correctly identified 83 as clean, corresponding to a true negative rate of 78% and a false positive rate of 22%. The overall accuracy of the model was 88%.
These results reflect several promising features of our proposed model. First, the algorithm can correctly classify anomalous data from an unknown rootkit. Because new malware is generated every day, the ability to generalize and identify unknown malware is required. Second, the accuracy of the algorithm is platform independent in both hardware and software, which would be required for any large scale or commercial use. Lastly, the algorithm performed well under a different use case than previously tested. This is necessary to accommodate systems with multiple users. In all, these results show that CPU power is both a reliable feature and able to generalize to many different situations.
Discussion and future work
While we have performed a number of successful test, the question still remains if this method could function in a real-world setting. Given the number of possible applications that could be running on a computer at any given time, modeling and simulating every possibility is not possible. In this research we intentionally analyzed data during boot, normal use, and shut down as a proof of concept. However, in a real-world setting, the data would likely be gathered during boot before other applications were loaded, and therefore not contaminate the data. This was the method used in our previous works [14,15,37], which were successful. We are confident this method would work given the difference in the data we observed at boot (see Fig. 3a and b). An alternative would be to run a specific script during known down times and gather data then. Another possible application could be to monitor supervisory control and data acquisition (SCADA) systems. These systems have minimal third-party applications, do not update to different software often, and only do specific task which are already known. In any case, it would be necessary to gather data on the system and adapt the already trained learning algorithm to fit the specific system better in order to achieve accuracy
Future work will include further testing and tuning of various machine learning algorithms to attempt to achieve higher accuracy. Many current machine learning algorithms, such as deep learning, do not require preprocessing of data. This could alleviate the need for normalization, and therefore increase the effectiveness of real time monitoring. However, these methods often increase the amount of training time required.
One issue not addressed in this research is adaptive adversaries. Adaptive adversaries purposely manipulate data to compromise machine learning algorithms [11]. The algorithms can be attacked during training of the model parameters or during testing when classifying samples [26]. The mitigation of these adversarial inputs remains an open problem. A primary strength of using CPU power data features is that the data collection takes place at a base-level making it difficult to implement adaptive adversary techniques. In addition, this research implements a methodology that captures data via a side channel analysis that operates independently from the system being monitored to mitigate adversarial activities further. Often, the success of an attack depends on available resources [49]. For example, attack success can be dependent on system access and algorithm knowledge. A benefit of the side channel approach implemented in this research is that it can be run independently and offline from the monitored system. Hence, the attacker would not have access to I/O pairs. The success of the attack is also dependent on the attacker’s knowledge of the algorithm. The results of this research demonstrate that the proposed method works on many different learning algorithms. Thus, using multiple models consecutively, or alternating models at random would decrease the likelihood of an attacker’s success. Another approach is to design the models themselves to be robust to attacks, which recent research has proven successful [59]. Future research will investigate effective and efficient solutions that mitigate adaptive adversarial techniques along with the viability of other solutions like artificial intelligence concepts.
Another option would be to use methods that are time dependent, such as recurrent neural networks and autoregressive models to evaluate the entire data set at once. We also intended on evaluating other stress test, such as cross platform test that work on most operating systems.
Another interesting aspect would be to consider adding additional features to the data to attempt to improve performance. These could include other side channels, such as temperature fluctuation, power measurements from other parts of the computer, or sensor arrays measuring many aspects of the computer simultaneously. While other measures may slightly improve performance, it is likely the improvement would be menial. It has been shown that computers are complex nonlinear dynamical systems [44]. It has also been shown [43] that in practice, there is no advantage to adding more features when evaluating complex systems. This is assuming one chooses a single signal that can truly represent the state of the whole system, which we believe is captured by the CPU. A better solution would be to combine this method with signature based methods capable of identifying known rootkits and other malware, which will be evaluated in future work.
In this research we focus on rootkits as part of evaluating an initial hypothesis on whether power side channels can be used to detect malicious software execution. Our original hypothesis was that rootkits, those that use system call hooking or alter normal system resource interfaces, by nature include more code than normal system calls or system access routines. Rootkits typically execute the original functionality of system/kernel level functions and then add additional code which is designed to hide their presence or provide exfiltration or system alteration. Additional code, theoretically, would translate to more machine code instructions, more CPU cycles, and thus more CPU power. General malware has a larger variety of functionality and potential usage and encompasses a much larger corpus of potential samples. Our future and next work in using this technique will be evaluation of general malware families. An external physical device will be implemented that can monitor the CPU to detect the presence of malware in near real time, and this device will be tested on real-world production servers for general evaluation.
Conclusion
This research presents a method of rootkit detection based on CPU power consumption. By measuring the power consumed by a CPU, we can detect the presence of malicious rootkits with high accuracy. The data sets used in the analysis were small enough to allow for fast training while not sacrificing reliability. Further, the data instances produced in the analysis were quite vague, providing little more than a minimum, maximum, average, and interval duration. Despite these restrictions, this method showed promising results. The results of the research demonstrate that this method can detect rootkits on four different operating systems, namely Windows 7, Windows 10, Ubuntu Desktop, and Ubuntu Server. It also shows the effectiveness of the proposed algorithm on different hardware, as well as four different rootkits, namely FUTo, KBeast, NT, and Azazel. This method can detect both user level and kernel level rootkits. The research results also demonstrate the detection capability on several different machine learning algorithms. Of those, the best performing algorithms were neural networks, decision trees, ensemble methods, and nearest neighbor methods. Furthermore, the capability of this method was demonstrated on two different tests, first under normal operating conditions and then during a stress tests. Lastly, a model inspired by biological studies of the human brain, called a nested neural network, could classify all data sets at 99% accuracy.
