Abstract
Recent research in the computer security sector has primarily been qualitative, focusing on detection aspects where malware has already infected the target. Challenges for security vendors include developing better techniques for early detection of malware attacks before they can do malicious damage. Malware prediction models are needed that can describe and predict malware generation processes. In this study, we address the feasibility of quantitative characterization of malware in security assessment, we propose a model with aim to explain the mechanism behind malware generation which contributes to estimating malware discovery. Although several malware modeling systems have been proposed, such models have shortcomings in related to historical data and do not consider malware data as a time series. Using time series analysis, we provide predictive neural network models for five datasets from Symantec and Malwr. The models explore the structures of malware data along with leveraging non-linear and linear properties to predict the number of future malware. Our examination also reveals that it is possible to model the malware discovery process using a neural network based non-linear model. In addition, our analysis provides insights into understanding the mechanisms that generate malware data series. This information can be useful for intelligence services and vital to threat assessment.
Introduction
Risk assessment is a first step of developing a computing system in companies. It helps to minimize resources and budget for security. The assessment consists of three factors such as threats, vulnerabilities and impact which are applied in various security risk assessment methodologies [23, 24] by a combination of Cartesian product. Threat assessment is the first step in a risk management. It considers the spectrum of threats which do harm to various computing components such as network, software, users, etc. Malware is a common threat that can steal, alter or loss data. For instance, Rocra, Code-Red, and Slammer are a few well-knownmalware threats that induced significant financial losses costing billions of US dollars [30]. An attacker can use a combination of various malware family to perform his actions. He deploys a strategy with a set of new malware type for complementing each other. For example, he lures a user by access to a website including a JavaScript downloader. It fetches other files to install Trojan or Backdoor on the victim’s computer and then steals the user’s information. This scenario shows a pattern that a number of new malware take.
However, based on historical data with frequency of occurrence for given malware, we not only manage malware threat but also infer the malware generation process. This method is applied in existing vulnerability discovery model on operating systems [2, 3] and browsers [27, 31]. These days, the combination of malware and vulnerability has made a large impact in the industry in which malware takes advantage of vulnerabilities to spread out their malicious behaviors. For example, Wannacry exploited the CVE-2017-0144 vulnerabilities in the Windows operating system to encrypt user data and extort money. Thus, along with vulnerability measuring, malware assessment also needs to foresee the future of malware variant and its generation mechanism. For example, the variant of Wannacry increased after it had been discovered.
By understanding the mechanism behind malware generation, we can develop guidelines for allocation of resources for malware discovery process. Furthermore, the combination of vulnerability and malware assessment can reduce redundancy in resources and procedures to handle potential breaches by malware attack. When used correctly, the insights of malware can help defenders predict attacks and provide action indicators taken during every stage of the attacks. In practice, anti-virus vendors weekly release their prediction of malware trends. However, most of them are based on their experience in cyber security. Obviously, we need a reliable and systematic approach to solve this problem by providing threat intelligence.
Studying malware trends from a quantitative perspective requires the use of models that capture repeated series behavior which represent malware data through development processes. Therefore, applying a time series model can be an efficient solution to these situations [8]. In this study, we scientifically propose a time series model to address the problem of forecasting the number of malware that appearing in the wild. Since autocorrelation characteristics exist in a malware series, time series methods not only allow us to predict malware trends but also understand the mechanisms behind their series generation.
Time series forecasting has been dominated by linear methods for decades. However, such methods are not able to capture nonlinear data relationships, and consequently do not give an accurate prediction for data [22]. A neural network can be widely applied to explain non-linear data since it not only has a nonlinear transfer function structure originally, but also captures linear structures as well [34].
With our proposed solution, we train a feed forward neural network by taking into account a number of malware in the past as input, and predict a number of malware at a given time in the future. We found that the non-linear properties of a neural network can explain the relationship between data at different given times. To figure out the relationship, we exploit the internal structure of the data points collected over time that tend to have an autocorrelation structure in time series observations [34]. By successfullydiscovering this relationship, our approach can help to predict a number of new malware threats and various malware types such as Downloader, Trojan and Backdoor.
Overall, the contributions of this paper as follows:
We propose an efficiently method for predicting a number of malware in the future using neural network which can capture both linear and non-linear structures in data. Our approach gives insights into understanding the mechanisms that generate malware data series such as generalizability across different types of malware evolution. To our knowledge, this is the first study to examine malware development using time series methods. Security services could utilize our findings to forecast future malware, allocate their resources more efficiently than at present, and measure the impact level of malware for threat assessment or early malware detection.
The rest of this manuscript is organized into several major sections. Section 2 reviews the literature regarding related work in the field of malware development. In Section 3, we explain the necessary background in time series modeling, review neural network models, and describe the particular architecture we chose to use for our method. Section 4 analyzes the collected data, proposes malware prediction models, and examines the fit and predictive capabilities of the models. We give the results of our empirical evaluation in Section 5. Discussion and limitations are presented in Section 6. The conclusion is given in Section 7.
Related work
There has been an extensive amount of research on building prediction model in a dataset for malware [29]. However, we found that only a few works in the literature have studied the datasets collected by public malware intelligence services such as Malwr, Global Intelligent Network by Symantec or Anubis to predict malware of the future. In [6], Bayer et al. provided a common malware behaviors based on the activities of binary samples. They extracted Windows API calls, system services and track data flows in the network traffic for each submission to the Anubis public sandbox. Based on these artifacts, they provided statistics of the samples in the Anubis repository, such as network and registry activity.
With the assumption suspicious submissions to malware intelligence service are likely related to malware, Graziano et al. [13] examined submission with static and dynamic features including file features, timestamps, binary, user-based and behavioral features on Anubis sandbox. Thereby, their approach contributes to detecting malware development process.
A closest to our approach considers the history characteristic of detected malware attacks to predict the number of malware infections in different malware and countries. In [17], Kang et al. analyzed the percentage of hosts in a given population. They provided a set of domain-based features that are related to the ability of hosts to detect malware and vulnerabilities by host malware bipartite graph and a bi-fix-point algorithm. Compare to our approach, we consider the history characteristic of detected malware in public site in term of time series data.
By considering malware evolution as a phylogeny, Hayes et al. [14] applied an approach taken from the field of biology on malware. They also considered history malware in terms of artificial generators. Malware variant relationship is a kind of software evolution for example a malware variant is a member of malware families in which they share some species in software. In our study we are not focused on the relationship between different species of malware, but rather we try to understand malware development in time series correlations.
In [15], Jang et al. studied inferences in software evolution by dissecting program binaries with an assumption that variant of a malware family reuse some functions from the first malware program. The authors particularly combined static and dynamic features with linear n-gram and normalization for building a model to recover the software lineage. With the evolution of malware on mobile devices, Ki-Hyeon Kim et al. [18] conducted a detection malicious codes technique on Android devices by using an autoregressive moving average model with current data. The idea is to extract 3 categories of features by using a system’s memory, CPU, network in the Linux kernel every 10 seconds, and modeling with multivariate time series for malware detection. Compare to our work, we consider the software development to be a different angle, by using univariate time series analysis to infer malware generation processes.
Tim series modeling for DDoS detection also has been applied in network security. Nezhad et al. [25] proposed a prediction method based on the ARIMA model to predict DDoS attacks through simulation studies with an NS2 simulator. They induced available service rates as features such as CPU time, memory utilization and networking buffering as a measurement to quantify a server’s availability in order to detect DDoS attacks. When servers are under DDoS attacks, they use abnormal detection to analyze the violation of the features of the prediction model.
Time series forecasting model
Time series modeling has been widely applied to various areas ranging from forecasting inflation and stock market behavior to computer security incidents [10], software reliability [4], and network security [26]. It involves building a model for a variable that is measured repeatedly over a period of time. It then uses this model to forecast future values. Time series models are not constructed to explain or measure the causal malware factors of underlying the behavior of the number of malware variable, but rather explore patterns in past movements to malware occurrence to forecast future.
Artificial neural networks
ANNs (artificial neural networks) are computing models for information processing through structuring and functioning inspired from biological neural networks. They are particularly useful for identifying fundamental functional relationships or patterns in data because of three features: parallel processing, distributed memory and adaptability [5]. It has advantages of robustness and tolerance to error and noise compared to other processing systems, specifically in time-series cases [16].
ANN is a class of flexible nonlinear models that can effectively discover patterns from data. In fact, there are numerous successful applications of ANN in pattern recognition and forecasting [9]. Most research has relied on feed-forward network modeling, the most popular neural network in time series models. In this paper, we use the feed-forward model as our basis for prediction.
In general, an ANN is made up of nodes or neurons organized in consecutive layers. A typical ANN model consists of an input, hidden and output layers. Each neuron in a layer has an associated weight and is connected to other neurons by communication links. The weights contain the knowledge or information that the ANN possesses about a specific problem.
Multilayer perceptrons
Multilayer perceptrons (MLP) consist of a feed-forward network which contain only a hidden layer. They are mostly used for time series modeling and forecasting purposes [9, 33]. The MLP model is characterized by a network of three layers of processing units connected by acyclic links. Each element in the input vector corresponds to an input node in the network layer. Hence the number of input nodes is equal to the dimension of the input vectors. In univariate time-series forecasting, the inputs of the network are the past observations and the output is the predicted value for future malware occurrences. The input vector for a time series forecasting is almost always composed of a moving window of fixed length along the malware series. The MLP performs mapping of the inputs to the output using the following equation:
Figure 1 shows a typical three-layer feed-forward network. The input nodes are the previous lagged observations yt-1, yt-2, …, yt-m and the output provides forecasting for the future value y
t
. Hidden nodes with appropriate nonlinear transfer functions process the information received from the input nodes. Its output is then fed into the output layer. The model can be formalized as:
m is the number of inputs nodes n is the number of hidden nodes f is the activation function which is derived from an approximation function, such as a sigmoid function: α
j
(j = 0, 1, …, n) is a vector of weights from the hidden to output nodes. β
ij
(j = 1, 2, …, n) is a vector of weights from the input to hidden nodes. β0j are weights of bias which have values always equal to 1.

MLP network for one-step-ahead forecasting based on m lagged values.
To construct an MLP model, we examine two stages: the running stage and the training stage. In the running stage, the input pattern is fed to the trained network and transmitted through consecutive layers of neurons until reaching an output. In the training or learning stage, the weights or parameters of the network are iteratively updated on the basis of a set of input-output patterns known as a training set. This minimizes the deviance or error between the output obtained by the network and the user’s desired output. The learning rules commonly used in this type of network are the back propagation algorithm or gradient descent method, developed and disseminated for updating the weights of the network [28].
Local minimum refers to a problem in preventing optimization in a neural network almost every time since the non-convex characteristic of cost-function. We avoid this issue by initializing different weights, optimizer, learning rate and momentum. Thus, those hyper parameters can improve model performance since it provides a better chance to reach the global minimum.
We describe our approach in terms of the architecture and the detailed mechanisms used in the following subsections.
Data acquisition
The battle between malware developers and anti-malware vendors takes place on many battlefields such as threat detection services or public sites like Malwr and Virus Total where new malware is developed to bypass anti-virus engines. We collect malware intelligence from two public sources: Malwr and Global Intelligent Networks.
Malwr is a popular online service that real-world users use to analyze suspicious files and URLs. The samples are labeled by 50 anti-virus vendors. In this study, we consider malware labeled by the Symantec anti-virus engine. We also want to predict the numbers of malware types such as Downloader, Trojan and Backdoor which are specified by a prefix in the malware’s name. For example: Backdoor.Alets.B is a backdoor. Global Intelligent Networks, sponsored by Symantec, is a public service which identifies, analyzes and provides commentary on emerging trends in the dynamic threat landscape. It contains publicly available information related to malware such as severity threat level, type and the time of discovery. We only collect malware from this dataset and aggregate over a monthly period from May 1999 to December 2016 to predict new malware developmentprocesses.
Instead of using the Symantec dataset only, we collect three malware types: Downloader, Backdoor and Trojan since they represent the majority of content malware threats subsequently. In addition, it is important we make sure our method can work well for different datasets in the real world. Collected intelligence is composed of two factors: the number of malware analyzed and the time that the malware was published. To achieve this, we develop a modified version of Davis’s asynchronous web scraper [11] to scrape malware intelligence from Malwr.
The dataset includes a time stamp of when users submit the malware files, as well as the type of malware detection and hash of files. Because Malwr does not check duplicated samples, we have to remove duplicated ones from our dataset. In summary, from April 4, 2013 to December 9, 2016, we gathered a dataset of 669027 samples. The aggregate number of malware files were gathered over a weekly period as independent variables.
Descriptive statistics are shown in Table 1. It reports the total number of malware datasets that we collected, the collection period and the average number of files. The average weekly returns vary across datasets, the average was highest for all types in the Malwr dataset (830.24) and the lowest in the Symantec malware dataset (77,95). Figure 2 graphically shows variations in the total number of malware aggregated for new malware, Backdoor, Trojan, Downloader and all other reported malware.
Descriptive statistics - Malware datasets
Descriptive statistics - Malware datasets

Number of malware.
In ANN methodology, data samples are frequently subdivided into three sets [7]: training, validation and test sets to obtain a model that can generalize well on previously un-encountered observations. However, in time series forecasting domains, it is common to use one test set for both validation and testing purposes particularly with small datasets [33]. Therefore, we divide our dataset into two segments: the training set and the test set with a ratio of 70% and 30% respectively.
During the training phase, the network’s weights are iteratively updated on the basis of the included variable’s values in the training set to minimize error between observed values and predicted values. We train the model in large iterations of the test set to obtain optimal weights. Nevertheless, an excessive number of parameters or weights can inflict an over-fitting problem where a model learns the detail and noise in the training data to the extent that itnegatively impacts the performance of the model on new data.
We address this point by running our experiments on the test set. While learning, the network modifies weights based on training data. As network errors are made, validation data is obtained. In the neural network, over-fitting concerns are usually measured using metrics such as root mean squared error (RMSE) or symmetric mean absolute percent (SMAPE) which can provide the lowest error rate as explained in 5.3.
Random walk analysis
Before examining the details of our approach, we need to check whether the data is random walk series, therefore predictable. If the data have random walk characteristics, and then output prediction can be similar to the original series when shifted forward in time by one time step. The suspicious shifted appearance of the one step ahead prediction would indicate an inability of the network to predict the time series since it is unable to capture the manifestation in the data.
We use the Lo Mackinlay variance ratio test [21] to examine our series. It examines the variance of the increments in a random walk. If malware series follows a random walk process, the variance of q-difference would be q times the variance of its first differences. The null hypothesis (H0) in relation to the malware series M
i
at the time t constructed (Mi,t) can be shown as a random walk series, where
The error term ɛ
t
satisfies Cov [ɛ
t
, ɛt-k] =0 for all k ≠ 0 and Cov [ɛ
t
, ɛt-k] ≠0 for all k = 0. A general q-period variance ratio statistic VR(q) can be constructed as:
Lo and MacKinlay show that a heteroskedasticity-consistent estimator of the asymptotic variance of VR (q) (denoted as θ (q)) is:
Thus, a standardized test statistic Ψ (q) can be used to test the null hypothesis:
To this end, we present our results in Table 2. We choose commonly aggregate value q = 2, 4, 8, and 16. If the decision is to “reject H0”, then there exists a relationship between the data in period q at a significance level (confidence interval) of 95%. The p-value shows that we reject the null hypothesis at the significance level, which means the series is not a random walk.
p-value variance ratio test for all datasets
p-value variance ratio test for all datasets
Preprocessing data refers to analyzing and transforming input and output variables in order to highlight the important relationships between input variables. In addition, these transformations can help the time series model to learn relevant patterns in the dataset.
Natural log transformation and first differencing are two preprocessing techniques commonly used in both traditional and neural networks forecasting [16]. Using logarithmic transformation has two primary benefits. First, it transforms original asymmetric data by correctly distributing it. Second, it reduces the amount of computation needed by changing multiplicative relations to additive relations. Thus, it simplifies and improves data modeling. Differencing or the use of changes in a variable remove linear trend and seasonality in a given set of data. Neural network can benefit from a rescaling of attributes such that they all have the same scale. We rescale our data to the bounds of activation functions to accelerate weight learning and avoid saturation or overflow of the hidden and output neurons. We use a min-max scaler to make sure all data are in the same range. To this end, Fig. 3 shows the preprocessed data.

Preprocessed dataset by natural log and first difference.
Neural network models vary in the number of input values, number of neurons in hidden layers and activation function. The hyper-parameters of the network were found by performing an grid search to obtain the smallest error on the test set. The model iteratively select the number of lagged (dimension), the number of hidden layers and their neurons, the activation function, the learning algorithm and the cost function used during the learning phase.
The greater the number of hidden layers present, the larger the computation time will be. The complexity of fitted model causing over-fitting problems which can lead to poor performance on out-of-sample forecasting. Thus, the smaller size of the model will be selected. In addition, when the network has more than 4 hidden layers, it does not provide satisfactory performance [33]. For these reasons, 3 layer feed forward networking is the most widely used model for time series modeling and forecasting.
Figure 4 shows the structure of the data matrix corresponding to the MLP network. With a lagged value of 3, the output value at t4 can be predicted by three previous values (t1, t2, t3), where t5 depends on (t2, t3, t4).

Structure of a time series matrix with lagged = 3 for training of the MLP network.
Neural network implementation
We have implemented time series model on the Tensor-Flow [1] framework, a software library for numerical computation using data flow graphs. The model contains a feed forward neural network with three layers, one input layer, one hidden layer and one output layer. Next, an optimal number of neurons is selected by grid search process.
To avoid getting stuck in local minimal, we adopt the common practice of multiple starts in neural network training. We train models with root mean squared error cost function. To increase the chance of getting the global minimum, an adaptive learning parameter is also used. The initial weight value is uniformally random. Other parameters are set for adamax, soft sign and adadelta as recommended by [19, 32] respectively. We train our model through 500 epochs to obtain the smallest error rate for the test set.
Table 3 describes the architectures used for experiments in section 5.3. The first column represents the datasets. The second column contains the number of neurons in the input, hidden and output layers. For example, the MLP network for the Symantec dataset has 3 input neurons, 4 neurons in hidden layers and 1 output neuron.
Best-fitted MLP structure
Best-fitted MLP structure
Prediction accuracy is an essential criterion for evaluating forecasting performance. We use symmetric mean absolute percent (SMAPE) error and root mean squared error (RMSE) to estimate model performance and reliability.
Root mean squared error (RMSE) is the square root of the mean of the differences between predicted values and observed values. The use of RMSE is very common and it makes an excellent general purpose error metric for numerical predictions.
Symmetric mean absolute percent (SMAPE) is based on the mean absolute percent error (MAPE), which is one the most commonly used metrics to measure the quality of fitness for a given model. However, MAPE was sensitive to data containing zeros or near-zero values which may turn to skew the overall error rate [12]. SMAPE has a lower bound of 0% and an upper bound of 200%, thus reducing the influence of zeros or near-zero values. It is given by the following formula:
Where:
A
i
is the actual value (actual number of malware). F
i
is the predicted value (predicted number of malware). N is the total number of prediction intervals.
Predictive capability of selected ANN models
Prediction models are subsequently used to generate the forecasting values from Table 3. RMSE shows that our model is a good fit for all datasets. Table 4 shows the performance measurements for 5 datasets with SMAPE and RMSE. We can see that the error rate for forecasting the number of malware is lowest for the Trojan dataset (12.36%). This implies that the number of malware at time t relies on the relationship to the number of malware at t - 1 and t - 2 time periods. The error rate for the other datasets corresponds to Backdoor, all types, Downloader and Symantec.
Forecasting performance of MLP networks with a preprocessed test set
Forecasting performance of MLP networks with a preprocessed test set
(*) All types(*) of malware in the Malwr dataset.
The period shown in Table 2 corresponds to the number of historical data for the best architecture. For all models, only 2 to 3 previous data points are sufficient to construct a model for forecasting. For example, in the Trojan dataset, period 2 is significant since 0.0317 <0.05 whereas period 4 is insignificant since 0.0726 >0.05. In addition, the input neurons for the Trojan model is 2 according to Table 3.
Table 3 shows that the selected MLP architectures presented the best forecasting accuracy for the test sets. It also implies that the adam-based optimizer is appropriate for malware prediction.
Figure 5 shows the graphs of the original malware data along with the fitted values obtained from the prediction models on the test set. The plots show that the prediction models provide a good value and closely match for all datasets.

Original values vs fitted values on preprocessed dataset.
Through our experiment, we found that time series models provide a good fit for our datasets and can be utilized to predict the number of malware. In particular, the prediction model for the malware sample in the Symantec dataset has the lowest prediction error (23.63% SMAPE) in the test set. In addition, our models demonstrate good performance on 3 types of malware: Downloader, Backdoor and Trojan. Analysis of malware types can provide valuable intelligence for incident response.
Nowadays, APT attacks usually involve unseen malware or 0-day exploits to compromise the target. Predicting the number of malware can assist security responders in detecting APT attacks by observing the evolution of malware.
We assume that malware development is affected by respective historical factors. To strengthen our assumption, we conducted an analysis on various malware datasets. As a result, our models can successfully reveal historical factors for forecasting the number of malware in the future.
To our knowledge, this study is the first of its kind to use time series models to predict the number of malware present. Our results revealed that time series models efficiently compliment malware analysis and detection. Although we believe that this study makes a number of contributions, it has some limitations. One of these is the sample data. Our study analyzed publicly reported malware without considering unreported or undisclosed data. For example, a malware developer may use a sandbox for checking their sample in private mode. Another limitation comes from the unstable malware repository which can inflict a malware time-line into ourdataset.
Conclusion and future work
In this work, we examined the feasibility of quantitatively characterizing some aspects of security. To achieve this goal, we first investigated whether it is possible to predict the number of malware by random walk analysis. Then, we built time series models to predict the number of future malware based on neural network.
To evaluate our model, we used Symantec Global Intelligent Networks and the publicly available sandbox Malwr. To this end, we achieved general models that can work well for various datasets. Specifically, our results revealed that time series models are beneficial and provide a further step towards more useful and accurate predictions of new malware. Additionally, we determined the appropriate NN architecture for this problem.
We conclude that historical data is significant for building prediction models. Specifically, in the lag range [2, 4], our model can achieve high performance with 12.36% of SMAPE. These results indicate that the model can be used for tracing malwaredevelopment.
Future research directions include enhancing the accuracy of the proposed models by including additional factors and considering time series models to predict different dependent variables such as malware severity or a number of new variants in the malware family. We are also looking for ways to extend our analysis by using linear or hybrid non-linear - linear time series models to improve performance. Time series models can be applied to other mainstreams of malware intelligence such as threat management, security assessment or malware classification.
Footnotes
Acknowledgments
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2017R1A2B4001801).
