Abstract
In recent years the number of new malware threats has increased significantly, causing a damage of billions of dollars globally. To counter this aggressive malware attack, the anti-malware industry needs to be able to correctly classify malware in order to provide defense against them. Consequently, malware classification has been an active area of research, and a multitude of malware classification approaches have been proposed in the literature. This paper evaluates two methods of sequence classification based on Hidden Markov Model, namely the maximum likelihood and similarity-based methods, for classification of malware using a large and comprehensive dataset. System calls generated by known malware during execution are used as observation sequences to train the Hidden Markov Models. Malware samples are evaluated against the trained models to produce similarity vectors, which are used in the maximum likelihood and similarity-based classification schemes to predict the family for an unknown malware sample. Comparison of the two schemes shows that combining the powerful statistical pattern analysis capability of Hidden Markov Models and discriminative classifiers in the similarity-based method results in a significantly better classification performance as compared to the maximum likelihood approach. Furthermore, evaluation of different classifiers in the similarity-based method demonstrates that Random Forest classifier performs better than other classifiers on malware similarity vectors.
Introduction
The malware problem is getting out of control, despite methods and solutions being continuously provided by cyber security related research community and industry. A report1 states the number of new malware variants created in the year 2014 to be approximately 317 million, which amounts to 870,000 new malware appearing every day. Consequently, global losses due to malware were anticipated to be around $400 billion in 20142, which is another indication of the severity of the malware problem. It is evident from this data that the malware developers devise methods to counter and evade the solutions provided by the computer security community for detecting and mitigating malware attacks, and hence there is a continuous arms race between malware developers and the security community. The research presented in this paper is an attempt to be a part of this fight against malware.
The most efficient malware detection method is the signature-based method [6], in which a suspicious program file is searched for code patterns which are specific to known malware. These patterns, also termed as signatures, are identified by malware analysts and are frequently updated to client computers to keep the signature database current and effective against known threats.
A vast number of new malware are obfuscated variants of existing malware [7], and show similar behavior traits despite having different code. Such variants developed from the same base are said to belong to the same malware family. The developers of anti-malware products, thus, need to identify families of malware so that signatures can be efficiently generated on family basis. Also, every new malware sample needs to be analyzed in order to determine if it belongs to a known family or is representative of a new family. These problems belong to the field of data mining, and therefore much of the research efforts have focused on the application of machine learning techniques to solve these problems such as classification [20, 22], clustering [4, 21], Hidden Markov Models (HMMs) [1, 27], etc.
The malware analysis, detection and classification schemes proposed in the literature make use of certain features or attributes, specific to computer programs, which can be used to determine if two given programs are similar to each other. Research on malware analysis has made use of two broad types of such features, namely static and dynamic features. Static features are those which can be extracted directly from a program binary, such as raw binary code or operational codes (opcodes) obtained by disassembling the raw code. Malware detection schemes based on these kinds of features could be evaded by the malware developers by using obfuscation techniques, and therefore dynamic features were suggested to differentiate between two programs on the basis of their behavior. Dynamic features are extracted by executing a program in a controlled environment and observing its actions in terms of system calls, API calls, etc. Research has shown that dynamic features are more robust against malware obfuscation as compared to static features [15].
This paper is an extension of previous work by the same authors [9, 10] in which two HMM-based methods of malware classification, namely the maximum likelihood and similarity-based classification respectively, were proposed using dynamic program features. The aforementioned works reported the initial findings of experiments on a small dataset of malware samples. This paper contributes to the knowledge by: conducting a rigorous validation of maximum likelihood and similarity-based malware classification techniques on a large and more comprehensive malware dataset, performing a comparative analysis of the two methods’ performance on the extended data, evaluating the similarity-based method using different discriminative classifiers.
The methodology for evaluating the malware classification schemes consists of two stages. First, HMMs are trained for each malware family using system calls as observation sequence, and the malware samples are evaluated against these models to generate what are termed as the score vectors. In the second stage, these score vectors are used in the maximum likelihood and similarity-based classification techniques to assign a family to an unknown malware sample.
This paper proceeds as follows. A brief introduction to the Hidden Markov Models is provided next while Section 3 reviews the literature on the application of machine learning techniques previously proposed in the domain of malware detection and classification. The methodology for the proposed research is presented in Section 4. Results of the experiments are discussed in Section 5, and Section 6 concludes the paper.
Hidden Markov Model
Hidden Markov Model is a statistical analysis method widely used in pattern matching applications such as speech recognition [17], behavior modeling [13], protein sequencing [12], and malware analysis [1, 27], etc. A simple Markov Model represents a stochastic system as a non-deterministic state machine, in which the transitions between states are governed by probabilities. When the system is in a given state, the next state can only be probabilistically determined. A Hidden Markov Model is a more specific version of the simple model in which the states are not visible. Instead, the progress of the process is observed through certain symbols which are emitted in each state. Again, the emission of symbols from states is a stochastic function in the sense that probabilities associated with each state reflect which symbol is more likely to be emitted by the process in a given state. Finally, there is a third set of probabilities representing the initial state distribution of the model [18].
Formally, an HMM λ with N states and M symbols is represented by (A, B, π), where A is a N×N matrix that represents the transition probabilities from each state to every other state B is a N×M matrix in which the symbol emission probabilities for each state are stored π is a 1 ×N matrix that determines the initial state probabilities
Hidden Markov Models have three basic problems associated with them: Given a model λ and an observation sequence O, how to calculate the probability that O is generated by λ? Given a model λ and an observation sequence O, how to determine the most optimal sequence of state transitions that produced O? Given an observation sequence O, number of states N and number of observation symbols M, how to estimate the model parameters (A, B, π) which maximize the probability of O?
In order to use HMM for solving a given problem, the problem must first be translated into one or more of the three HMM problems stated above. Well-established algorithms exist for solving the HMM problems, therefore the problem at hand can be conveniently solved using these algorithms.
It should be noted that problem 2 is not of much significance for the malware detection problem, since it does not play any part corresponding to training or testing phases of machine learning. Nevertheless, it can be helpful in getting an insight into the inherent malware behavior represented by the given sequence of observations. For the purpose of malware analysis and detection, HMM problem 3 and problem 1 need to be solved. First, the model parameters are estimated using observation sequences for known malware. This step is analogous to training in machine learning terms. For training an HMM, an observation sequence O is required along with π, A and B matrices with usually randomized probabilities. HMM model generation is performed using Baum-Welch algorithm [18] which iteratively re-estimates the probabilities in the three matrices until an acceptable estimation has been achieved.
In evaluation or testing step, the forward-backward algorithm is used to determine the probability of a given sequence being generated by a given model [18]. The output of the forward-backward algorithm is a likelihood value; an observation sequence closely resembling the sequences used for training the HMM will have a higher likelihood as compared to a sequence which has a marginal or no resemblance with the training sequences.
Related work
A significant body of research has been contributed towards malware analysis and detection since the dawn of this century. Various approaches including machine learning, graph theory, finite state machines, etc., have been explored in order to find efficient methods of analyzing and detecting malware. Since the bulk of such efforts have employed machine learning and the work presented in this paper also makes use of the same, therefore a review of some representative techniques employing machine learning methods is presented here.
In one of the pioneering works on application of supervised machine learning in the area of malware analysis and detection, Schultz et al. [22] used multiple static attributes as features in various classifiers including RIPPER, Naïve Bayes and Multi-Naïve Bayes. Their experiment showed a clear advantage of machine learning-based technique over the signature-based methods, and paved the way for future research in this domain. Rieck et al. used Support Vector Machine (SVM) for malware classification using dynamic features [20]. In their proposed scheme runtime behavior of malware, in terms of Application Programming Interface (API) calls, is recorded by executing it in a sandbox. The features extracted from the behavior profiles for each malware family are used for training the SVMs, which are then used for classifying unknown malware. Islam et al. [11] proposed a combination of static and dynamic features for malware classification. The features included the lengths of functions, the count of printable strings found within the binaries, and the frequency of API calls. The authors claimed that the hybrid approach is better than using the individual features.
Among the previously proposed approaches based on unsupervised learning, Ye et al. put forward the idea of using an “ensemble” of multiple clustering algorithms on static data for malware analysis [28]. Rieck et al. [21] used a combination of supervised and unsupervised machine learning techniques. They utilized system calls, generated by executing malware binaries in CWSandbox [26] and converted to MIST [24] format. In their method clustering is performed on behavioral profiles of unknown samples to identify malware clusters, and then new unknown binaries are classified into the identified clusters using the nearest prototype classifier. Incremental analysis is performed through an interleaved execution of classification and clustering steps.
Arguably the first application of the Hidden Markov Model (HMM) to the cyber-security domain can be attributed to Warrender et al. [25] who compared HMM with three other models for intrusion detection purposes. All models were tested on the same dataset, which consisted of system call logs for certain programs to represent “normal” system behavior. The study showed that the HMM performed better than other methods in terms of accuracy, but with high computational costs. In the malware detection domain, Wong and Stomp [27] trained HMMs over observation symbols composed of opcode (short for operational code) sequences for detecting metamorphic viruses. Their experiment proved that the HMM could be effectively used for metamorphic malware detection.
Borrowing the idea of sequence alignment from the bioinformatics domain, Attaluri et al. used Profile HMM (PHMM) to detect metamorphic viruses [2]. PHMMs are a special kind of HMMs which take into account the relative position of symbols within the observation sequences when scoring an unknown sample. In another PHMM-based approach proposed by Ravi et al. [19], system calls have been used as observation sequences. The authors have performed malware classification using the maximum likelihood approach on similarity scores on a smaller malware dataset. The major contribution is claimed to be clustering of malware samples based on similarity scores.
Austin et al. experimented with HMMs in order to gain an insight into the semantics of hidden states [3]. By varying the number of hidden states in the HMMs, the authors identified the most significant opcodes and the associated probabilities. Furthermore, they proposed a technique to perform malware detection using two sets of HMMs, one for benign programs and the other for known malware. Given an observation sequence for an unknown sample, the HMM which yielded the greatest probability of generating the given observation determined the nature of the sample. Recently, Annachhatre et al. have trained HMM over opcode sequences of programs generated by different compilers and virus generators [1]. The scores given to samples against the various HMMs are used as distance metric in k-means clustering algorithm.
An inherent drawback of hidden Markov models is the long time required for training. This problem has been addressed in the literature, and solutions to this problem range from use of optimization techniques to efficient utilization of hardware resources. One algorithmic approach for speeding up model training is the Particle Swarm Optimization (PSO) which has been applied for parameter optimization in different kinds of models such as Muskingum model [23] and HMM [16]. Another method of speeding up computations is to use Graphical Processing Units (GPUs), which have been shown to be effective in tasks such as matrix multiplication through use of parallelization [14]. Recently, Yu et al. [29] have demonstrated the usefulness of GPUs in reducing training and testing time for HMM in a speech recognition application.
Methodology
The experiments reported in this paper involve training of an HMM over observation sequences consisting of system calls. System calls have been extensively used, in different variations, by the research community for modeling the behavior of executables [25]. For the purposes of the current research, a simple sequence of system calls has been used for training the HMMs.
The dataset
For performing the experiments, the dataset has been borrowed from the Malheur project [21]. The dataset includes system call logs for numerous malware recorded through CWSandbox [26]. The system call logs are provided in various formats but for the sake of this research, MIST format [24] has been used. The MIST format represents system calls in multiple degrees of detail using a simple text layout, and therefore allows easy extraction of system call logs for a specific granularity.
The MIST format categorizes the 120 monitored system calls into 20 categories according to the higher level operation being performed. System calls are represented by category ID, followed by the operation ID and arguments, if any. The current experiment ignores the system call arguments and focuses on the higher level category and the individual function call only. Furthermore, to simplify the representation of the (category, call) pairs as an observation sequence, each system call is represented as a number from 1 to 120. This is done by using the information provided in [24] about the number of system calls in each category. The observation sequence, thus, becomes a sequence of numbers in the range from 1 to 120.
The full Malheur dataset contains system call logs for over 32,000 samples belonging to more than 400 malware families. These samples include 6,003 files believed to be benign, labeled as NOTHINGFOUND. For the presented work only those malware families have been selected which have more than 100 samples. This criterion was set in order to meet the requirement of sufficient number of samples for training the HMMs. Thirty seven such families were found (including the benign samples) which fulfilled this criterion, with the total number of samples being in excess of 29,000. Some of the malware families had too many samples therefore only 400 samples from each of those families were taken in order to make the dataset balanced. After the data reduction process there were 8,828 malware samples left. In addition, 300 benign samples grouped under NOTHINGFOUND class were also included in the dataset. Details of the selected malware families are provided in Table 1. The table shows, for each malware family, the number of samples (observation sequences), the maximum, minimum and average number of observations, and the cumulative length of the observation sequences used in the experiments.
For a comprehensive evaluation of the proposed methodology, 5-fold cross validation was performed in the HMM training phase. For each family, five HMMs were trained such that for each HMM a different set comprising 80% of the behavioral reports for that family was used. During the testing phase, all the malware samples were evaluated against the five HMMs of each family. In this way, the proposed scheme was evaluated for three types of samples: those belonging to the same family and used in training (called known family samples), those belonging to the same family but not used in training the HMMS (termed as unknown family samples), and those belonging to other malware families (referred to as non-family samples). Furthermore, benign samples were also subjected to evaluation against the malware HMMs in order to assess the performance of the HMM-based classification techniques for those samples which are not malware.
Training the HMMs
HMM model generation and testing was done in Matlab using HMM Toolbox3. For training the HMMs all the malware families were used except the class representing benign files (NOTHINGFOUND). For training an HMM for a particular family, 80% of the observation sequences belonging to that family were fed sequentially to the Matlab routine which iteratively re-estimated the model parameters. The maximum number of iterations was set to 100, though in most cases the training stopped earlier because of successive log likelihood values being closer than the predefined threshold. The number of hidden states was set to 10 as a median between too few (2-6 used by [27]) and too many (20, the number of categories of system calls used as observations). The number of observations was fixed to 120, corresponding to the number of unique system calls in the data.
Evaluating the samples
In the testing step, each of the 8,828 samples was evaluated against all the trained HMMs. As discussed earlier, testing a sample sequence against an HMM returns a likelihood value (also termed as score) which determines how closely the sequence matches with the sequences that were used to train the HMM. The Matlab routine used for this experiment returns the log of likelihood values. Thus, testing a single sample against all the models produced a 36-long vector of log likelihood scores (subsequently termed as score vector), and testing all the samples resulted in 8,828 such vectors representing the scores of all samples against all models. Since the likelihood scores fall in the range of 0 to 1, therefore the log values in the vectors were all negative, including -∞ for cases where a sample had no resemblance with the modeled malware family.
Classifying the malware samples
Bicego et al. [5] have identified two ways of using the HMM-generated score vectors to classify sequences, namely maximum likelihood and similarity-based methods, which are used for the purpose of malware classification in this work. Details of the two classification methods are describednext.
Maximum likelihood classification
In this method of classification, a sample sequence is assigned a class label against whose model this sample obtains the highest likelihood score. In the case of malware classification, an unknown program that generates a system call log sequence s, is considered as belonging to the ith malware family if the log likelihood score of s against the model λi∈N is the highest among all N models. This can be represented as:
Similarity-based classification
The similarity-based classification scheme treats the set of score vectors obtained by evaluating sequences against HMMs as a feature space, termed as the similarity space by Bicego et al. [5], which can then be used for training a discriminative classifier. The so trained classifier would then be able to classify an unknown sequence, given its similarity vector (the score vector), into one of the classes the classifier has been trained on. The rationale behind the name “similarity-based” is that the likelihood score obtained by evaluating a given sequence against an HMM represents its similarity with the sequence(s) used for modeling the HMM.
The basic similarity-based approach for sequence classification trains HMMs for individual sequences. Thus, if this method was adopted for the current experiments then a total of 8,828 HMMs would have to be trained. Since in this experiment the HMMs were trained on class basis, therefore the similarity-based approach could not be applied as such. Instead, a modified form of this approach was employed, also suggested by Bicego et al., in which the class-wise likelihood score vectors are used as feature vectors to train the discriminative classifier. Weka [8] was used to train the Random Forest classifier over the labeled class-wise score vectors. Random Forest classifier was chosen for this experiment because in empirical studies on small dataset this classifier had shown better performance than Decision Tree, SVM, etc. 10-fold cross validation was performed for evaluating the effectiveness of the classifier on the score vectors for each of the five sets of data. Since Weka does not recognize the symbol –∞ returned by Matlab routine, therefore all instances of –∞ in the score vectors were replaced by the number –1000000, representing a very low likelihood, before using the score vectors in Weka.
The case of benign programs
Although the primary focus of this paper was classification between malware families, the schemes under evaluation were also judged for their capability to handles benign files, i.e., normal program files known to be clean from any infection. For this purpose, the behavioral reports of 300 benign files grouped under NOTHINGFOUND family were evaluated against the 36 HMMs. The so obtained score vectors for benign samples were then subjected to both the maximum likelihood and similarity-based classification as described next.
As suggested in [9], a benign program’s behavior is not expected to match any of the malware behaviors. Consequently, the score vector of a benign program should ideally consist of only –∞ values. The maximum likelihood classification rule can, then, be modified to accommodate the benign samples as under:
In the case of similarity-based classification, the score vectors for benign samples need to be tested against the classifier trained on the malware samples. It should be noted that the classifier trained on malware samples will only attempt to classify an unknown sample into a known malware family. In order to enable the classifier to identify benign samples, is should be trained on score vector(s) representing benign samples. As discussed above, the ideal score vector for a benign sample should only contain –∞. Therefore, the Random Forest classifier was trained on an additional score vector containing all values of –1000000, labeled NOTHINGFOUND.
Evaluation parameters
Precision and recall are two widely used parameters for evaluating the performance of classification methods. Simply put, precision determines how many of the items labeled as the class C actually belong to C, and recall specifies how many items of a class C are correctly classified as C. The formulas for the two parameters are given as:
To simplify the comparison of performance of different classification methods, precision and recall are usually combined into one parameter, called F-measure, defined as
An F-measure of 1 for a classification method means that the method was able to perfectly identify all the samples of each class. A low F-measure value, on the other hand, is result of either or both of the precision and recall being low, signifying the fact that certain items were misclassified by the classification method.
The malware classification schemes discussed in this paper were evaluated using F-measure.
Results and discussion
Maximum likelihood classification
In contrast with the initial study [10] on use of HMM-based maximum likelihood scheme for malware classification, the current experiment on a larger collection of malware samples did not yield satisfactory results. Although for more than one third of the malware families the obtained F-measure was 0.99 or close, yet the overall average was 0.62.
One high scoring family was ALLAPLE with an F-measure of 0.996. The scatter plot in Fig. 1 shows the log likelihood scores obtained by ALLAPLE samples against ALLAPLE, EJIK and VIRUT family models. The higher the likelihood value for a sample against a model the greater the probability that the sample resembles the model. All values in the top portion of the scatter plot represent the likelihood scores of ALLAPLE samples against ALLAPLE model. For the other two models, the likelihood scores fall below, thereby making the classification decision clearly in favor of ALLAPLE family. This scatter diagram shows the likelihood scores of ALLAPLE samples against only three malware families; against all other models, these samples scored –∞, showing no resemblance with those families. The clear separation between the three data series is an indication that ALLAPLE family’s behavior was consistent against all the models.
Some malware families with low F-measure were BIFROSE (0.10), BUZUS (0.11), PATCHED (0.12), and FRAUDLOAD (0.13). At first glance it looks counter-intuitive that a sample which has been used to train a particular HMM scores less than an unknown sample when tested against the model. Upon a closer inspection of some wrongly classified samples, such as in BIFROSE family, it was found that the error in misclassification was actually not indicative of a problem in the classification process. Instead, sometimes a significant portion of the behavioral profile of a sample can match that of a malware from another family.
The scatter plot of Fig. 2 shows the likelihood scores for BIFROSE samples against BIFROSE and REFROSO family models. From the figure it is clear that the scores obtained by the BIFROSE samples against the two models fall within the same range, and therefore a correct classification decision is not possible on the basis of the maximum likelihood rule. As a result, many samples of BIFROSE family were classified as REFROSO by the maximum likelihood classification method. Interestingly, an online threat analysis report4 verifies that different anti-virus software may identify a malware sample as belonging to BIFROSE or REFROSO families. This explains one reason for the misclassified samples while using the maximum likelihood approach on the score vectors.
The maximum likelihood classification rule failed completely when applied to the benign samples. Out of the 300 benign samples, none was correctly identified as benign; all these samples were labeled as different malware families. This was because every benign program showed some behavioral similarity with one or more of malware families, and no benign sample was able to score –∞ against all the 36 models representing malware families in order to be identified as benign.
Similarity-based classification
The similarity-based classification approach produced much better results as shown in Fig. 3 which plots the family-wise F-measure for the two classification methods. For roughly one-thirds of the malware families the F-measure remained well above 0.9, while the average F-measure across all the families was 0.88. Particular attention may be paid to the families BIFROSE, BUZUS and FRAUDLOAD for which the F-measure for similarity-based classification increased more than five times as compared to the maximum likelihood method. Other families such as RBOT and REFROSO also showed remarkable improvement as their F-measure values almost quadrupled for the similarity-based scheme. The average F-measure values for the two schemes are provided in Table 2.
The confusion matrices for the two classification schemes are shown in Fig. 4a and 4b respectively. A comparison of the two figures shows the supremacy of the similarity-based classification method. Quite a few gray boxes, representing the mis-classified samples, are seen on either sides of the diagonal in Fig. 4a.The column for REFROSO family is particularly noticeable, signifying the number of samples from other families classified as REFROSO. In comparison, the matrix in Fig. 4b is mostly clear, asserting the fact that the Random Forest classifier was able to classify a majority of the samples correctly. The two dark boxes depict the samples of TEXEL and PATCHED families which the classifier was not able to label correctly. A malware analysis report5 shows that different anti-virus scanners may also label a given malware sample as a TEXEL or PATCHED family malware, confirming the finding by the similarity-based classification that the two families share some common behavior.
Two different approaches were taken in order to evaluate the similarity-based classification method for benign programs. In the first approach, a feature vector consisting of –1000000 in all the 36 places was labeled as NOTHINGFOUND, and this feature vector was added to the set of feature vectors representing malware samples. Random Forest classifier was trained on the modified set of feature vectors. The trained classifier was then evaluated on the 300 feature vectors for benign programs. The classifier was not able to identify any benign sample as benign; instead, these samples were all assigned labels from different malware families. This was due to the fact that the training instance for the NOTHINGFOUND class represented the ideal case that a benign program should not have any resemblance with any malware family. On the contrary, the actual set of feature vectors did not have any suchinstances.
To overcome this limitation, a new experiment was performed by including the feature vectors for 300 benign samples in the training dataset for the Random Forest classifier. In this way, the classifier would be required to train on malware as well as benign samples, and therefore would be able to perform a better classification of the benign samples. Using this approach, 10-fold cross validation was performed and the classifier yielded an average F-measure of 0.86. The F-measure of 0.68 for the benign samples was poor as compared to the overall average, because of the fact that benign samples do not have much behavior in common and therefore such diverse behaviors cannot be precisely captured by the classifier into a single model.
Further experiments were conducted in order to investigate which classifier is best suited for the similarity-based classification. For this purpose, four more classifiers including J48 Decision Tree, SVM, Naïve Bayes and k-NN, were assessed for their prospective use on malware score vectors. As Fig. 5 conveys, Random Forest proved to be the most suitable classification algorithm for use with the malware score vectors in the similarity-based classification method.
Conclusions
This paper evaluated and compared two HMM-based approaches for performing malware classification using dynamic program features. Experiments on a dataset containing behavioral reports of 8,828 malware samples demonstrated that the similarity-based malware classification method had a clear edge over the maximum likelihood classification scheme. It is deduced that the classification decision should not be made on the basis of closest match of a malware sample’s behavior with a single malware family; the pattern of the sample’s similarity with all the malware families must be taken into account for deciding the family of an unknown malware sample. Once such similarity patterns are available for known malware samples, a discriminative classifier trained over these patterns, or vectors, can effectively classify an unknown malware sample into known malware families. It was also observed that Random Forest is a better choice of a classifier to be used with the similarity vectors in the similarity-based scheme as compared to other classifiers such as J48, SVM, Naïve Bayes, and k-NN.
When tested for 300 benign programs, the maximum likelihood classification method was not able to classify even a single benign program correctly. The similarity-based scheme again performed better but there is still a lot of room for improvement since its F-measure for benign programs was only 0.68. This aspect of HMM-based malware classification needs further attention, and is, therefore, the focus of future research. In addition, this research can also be extended to investigate the performance of the similarity-based method using other dynamic features as well as hybrid feature set. Reduction of model training and testing time by using algorithmic and hardware optimization techniques is another potential direction for extension of the presented work.
