DDIML: Explainable detection model for drive-by-download attacks

Abstract

A drive-by download is a method of hackers planting the Web Trojan, which exploits browser vulnerabilities to execute malicious software. Because people usually access web pages with various browsers daily, drive-by downloads have become one of the most common threats in recent years. Most previous studies utilize the abstract syntax tree(AST) with deep learning methods to detect such attacks, which achieved high accuracy but are time-consuming and challenging to explain. Also, some methods use dynamic analysis, which needs a specific environment and is time-consuming with the complex operation. In order to solve these problems, the paper proposes DDIML, an explainable machine learning model based on novel features with static analysis. These features are extracted from five aspects: code obfuscation, URL redirection, special behaviors, encoding characters, and CSS attributes. The most popular machine learning algorithm, Random forest, is applied for building the classifier detection model. In addition, we use both local and global explanations to improve the model and prove that the proposed model could be trusted. The Experimental results show that our proposed model can efficiently detect drive-by downloads with a detection precision of 0.983 and a recall of 0.980. The average detection time for each sample is only 16.07ms in total.

Keywords

static analysis drive-by downloads features random forest explanation

1 Introduction

With the popularity of computers and the maturity of computer technologies, web applications have become an indispensable part of social life and work. They bring convenience to people’s lives, work and study, but they have potential risks. By now, network attacks aimed at web applications are still increasing. The most common one is the drive-by download attack, which is difficult to detect and easily causes severe threats because of its high concealment and flexibility.

Data published by The European Union Agency for Network and Information Security(ENISA) in 2018 shows that web-based attacks ranked second [1]. Based on the security report of Tencent Anti-Virus Lab in the Q3 quarter of 2018, the PC side has intercepted 1.2 billion viruses with an average of 133 million Trojans every month. The trojan is the first class of virus, which accounts for 57.57% of the total number of viruses [2]. According to the 117th CNCERT Internet Security Threat Report in September 2020, more than 1.18 million hosts corresponding to IP addresses in China are controlled by Trojans or bots [3].

Drive-by downloads specifically refer to installing malicious programs to users’ devices without their knowledge or consent. The attack infects users’ computers with vulnerability automatically and silently. Hackers invade insecure websites to plant malicious scripts on some pages. When a user accesses the infected website, the malicious script may install malicious programs on the user’s device or redirect the victim to other sites that attackers control. In many cases, attackers obfuscate malicious scripts, making them more difficult to detect by network security researchers. Many latest works apply AST with deep learning methods to detect the attack. These methods can achieve high accuracy, but they are time-consuming and error-prone in the actual application. Also, some dynamic analysis methods are time-consuming with the complex operation. In order to improve efficiency and ensure the accuracy of the detection, our paper combines feature engineering with the machine learning method for detection. The paper makes the following contributions:

This paper manually extracts features from five aspects: code obfuscation, URL redirection, special behaviors, encoding characters, and CSS attributes. Among static analysis methods, the feature types we extracted are relatively comprehensive.

The model DDIML is interpretable. We give local explanations and global explanations to the model. Local explanations focus on the impact of each feature in the sample on its prediction result. Moreover, global explanations express the feature importance and the relationship between feature values and prediction results. Local and global explanations help to understand the prediction better and improve the transparency of the model.

Compared with other methods, DDIML has higher accuracy and is less time-consuming. The accuracy of our model reaches 0.982, and the average detection time of each sample is only 16.07ms.

The rest of the work is organized as follows. Section 2 briefly introduces the related work about malicious JavaScript detection. Section 3 discusses features, the Random forest algorithm, and the explanation method in detail. Section 4 provides experiment design, analyzes the experiment results, and visualizes the interpretation of the model. Section 5 provides conclusions and future work.

2 Related work

The drive-by download is the most common attack method, which causes a severe threat to people’s information privacy and tenure security. Network security researchers have paid close attention to it. With machine learning and deep learning development, researchers have made many contributions to detecting malicious JavaScript codes. In general, we divide the research methods into static, dynamic, and semi-dynamic analysis according to whether JavaScript codes are executed.

2.1 Static analysis

The static analysis focuses on extracting static features from JavaScript codes. Likarish et al. extract 65 features, but 50 are from JavaScript keywords and symbols. The extracted feature types are not comprehensive enough [4]. Curtsinger et al. propose Zozzle, which extracts hierarchical features from the JavaScript abstract syntax tree and hooks into a browser’s JavaScript engine to solve the deobfuscation [5]. In paper [6], Canali et al. design a filter named Prophiler, extracting HTML features, JavaScript features, URL, and host-based features to construct three different feature sets. They trained three models with different feature sets to get the best effect. Also, the method has less overload than dynamic analysis methods. Nayeem et al. present an interceptor between browser and server, which extracts features and uses the wrapper method for feature selection. The interceptor leads to high accuracy [7].

Static analysis methods can quickly detect most samples for some malicious scripts, but they lack semantic information. To solve this problem, some researchers use AST to extract semantic features [8 –11]. Also, some researchers leverage AST to build Graphs to extract features based graph information [12, 13]. These approaches provide better accuracy and performance than conventional approaches. Furthermore, to detect obfuscated JavaScript codes, Morishige et al. reconstruct the divided URL to improve the detection effectiveness [14]. Stokes et al. propose the ScriptNet system, which contains Pre-Informant Learning to process JavaScript files as byte sequences [15]. Guo et al. propose GAN, which can achieve high accuracy with small labeled samples [16].

2.2 Dynamic analysis

The dynamic analysis mainly relies on client honeypot technology, which simulates the communication between the browser and the target station in a virtual environment. It distinguishes normal JavaScript codes from malicious by monitoring system behaviors.

Wang et al. design the Strider HoneyMonkey, which generates an XML report containing executable files, process creation, vulnerabilities, etc. HoneyMonkey successfully detects the javaprxy.dll vulnerability [17]. Mitsuaki et al. [18] mainly extract features from exploiting phases, multiple crawler processing, tracking of malware distribution networks, and malware infection prevention. They implement a new client honeypot to detect drive-by downloads, detecting and investigating various malicious websites. Jayasinghe et al. extract opcode feature with dynamic analysis to detect drive-by download attacks [19]. The method is efficient and has low resource consumption. Xue et al. combine data dependency analysis, defense rules, and replay mechanisms to classify JavaScript codes as normal or malicious [20]. Their method is scalable and efficient but consumes more time and resources.

2.3 Semi-dynamic analysis

The semi-dynamic analysis combines static analysis and dynamic analysis. Cova et al. propose an approach named JSAND, which uses HtmlUnit emulation to execute codes and extracts ten features from redirection, deobfuscation, environment preparation, and exploitation [21]. It can detect attacks that have not occurred before and reduce the false-positive rate but low speed. Rieck et al. present a system named Cujo, which uses ADSandbox and SpiderMonkey for dynamic analysis. The System detects 95% of drive-by downloads with few false positives and an average run time of 500ms per sample [22]. JSDC et al. design the tool named JSDC, which extracts features from the text, program structures, and risky function calls to detect malicious JavaScript codes [23]. Compared with other tools, JSDC gives low false-positive and false-negative rates. He et al. [24] implement a browser plugin called MJDetector, which conducts syntax analysis and dynamic instrumentation to extract features. The plugin can detect obfuscated malicious JavaScript and has high accuracy.

Although dynamic analysis has high accuracy, it has a heavy overhead and is time-consuming. This paper extracts 49 features statically, including 26 features proposed in other papers and 23 new features. As our experiments demonstrate, these new features significantly contribute to high accuracy for detection. Also, it is faster to implement detection than dynamic approaches.

3 Proposed method

In this paper, our goal is to classify JavaScript codes as either malicious or benign, as likely to have a drive-by download attack or not. To implement the classification, we analyze features in kinds of literature [7 , 24–26] and extract 26 valuable features. Also, we extract 23 novel features by analyzing our samples. The overview of our model is depicted in Fig. 1. In the following, we first describe the features and then introduce the Random forest algorithm. Finally, we describe the specific definition of SHAP and why we choose SHAP to explain our model.

Fig. 1

Overview of DDIML architecture.

3.1 Features extraction

We have extracted different features from malicious and benign JavaScript files in the feature extraction. We have written a feature extractor with regular matching in python, which takes a JavaScript file as input and the feature matrix as output. Malicious JavaScript codes are generally transformed, encoded, and encrypted to conceal attackers’ purpose and intention. We manually analyze samples and extract features from five aspects as follows.

Code obfuscation. Attackers often use obfuscation techniques to make code difficult to read to hide the malicious code and avoid signature detection. Escape encryption is a commonly used obfuscation technique where encrypted data is separated by ’% ’. The encrypted data is decrypted with the unescape function and executed with the eval function. Although there are many ways to encrypt strings, they must primarily be executed by the eval function. So the eval function appears frequently in malicious codes. In many cases, obfuscating a string in malicious codes will result in a longer string than normal codes. Simultaneously, the greater the information entropy of the string, the worse the readability.

URL redirection. Attackers widely use redirection technology in drive-by download attacks. Scripts change DOM elements during execution, and attackers can load malicious URLs to the page through the src or href attribute. In addition, attackers can use the object methods of document.URL or location.href to achieve redirection. The location object contains information about the current URL. Attackers can use object events to open a new Trojan page.

Special behaviors. Some functions often appear in malicious JavaScript. For example, script tags or iframe tags can be dynamically created by createElement(’script’) or createElement(’iframe’). And then call appendChild(), insertBefore() or other functions to insert the tags into the DOM. The content in the tags which are appended dynamically may be malicious. Also, attackers usually split malicious codes into small snippets and then use document.write() or document.writeln() to write the snippets into a page. Some JS codes load external files which are suspicious.

Encoding characters. Base conversion is also commonly used for encryption, including octal and hexadecimal. There will be many characters ’∖x’ after the hexadecimal encoding in JavaScript codes. Also, when JS codes are obfuscated with Unicode encoding, ’∖u’ or ’% u’ is used. In a word, encoding obfuscation will appear a large number of special characters and digits.

CSS attributes. Some CSS attributes can hide malicious codes. For example, the iframe injection attack has high concealment, which sets the width and height of the frame to 0 to make the iframe invisible. Because of the iframe injection attack, it is difficult for users to discover the current web page’s abnormalities. In addition, attackers can also set display, opacity, allowTransparency and other attributes to hide malicious codes.

We describe the specific features in the tables, where the reused features are illustrated in Table 1, and the new features we extracted are shown in Table 2.

Table 1
Reused features extraction in our model

Feature type No Feature name Feature description

Code Obfuscation f1 Num_charAt The number of charAt()

f2 Num_charCodeAt The number of charCodeAt()

f3 Num_split The number of split()

f4 Num_escape The number of escape()

f5 Num_unescape The number of unescape()

f6 Num_eval The number of eval()

f7 Num_String The number of String object

f8 Num_fromCharCode The number of fromCharCode()

f9 Num_parseInt The number of parseInt()

f10 Num_indexOf The number of indexOf()

f11 Len_maxStr The largest length of the string in JS codes

f12 EntropyValue_maxstr The information entropy of the longest string

f13 Num_replace The number of replace()

f14 Num_join The number of join()

Special Behaviors f15 Num_createElement The number of createElement()

f16 Num_activeXObject The number of activeXObject()

f17 Num_write The number of document.write() and document.writeln()

f18 Num_addEvent The number of addEventListener() and attachEvent()

f19 Num_child The number of appendChild() and insertBefore()

f20 Num_setTimeout The number of setTimeout() and setInterval()

f21 Num_iframe The number of iframe tags

f22 Num_script The number of script tags

f23 Num_keywords The number of keywords, such as var, return, void, null, function, etc.

Encoding Characters f24 Num_unicode The number of Unicode encoding

f25 Num_octal The number of Octal encoding

f26 Num_hex The number of Hexadecimal encoding

Feature type	No	Feature name	Feature description
Code Obfuscation	f1	Num_charAt	The number of charAt()
	f2	Num_charCodeAt	The number of charCodeAt()
	f3	Num_split	The number of split()
	f4	Num_escape	The number of escape()
	f5	Num_unescape	The number of unescape()
	f6	Num_eval	The number of eval()
	f7	Num_String	The number of String object
	f8	Num_fromCharCode	The number of fromCharCode()
	f9	Num_parseInt	The number of parseInt()
	f10	Num_indexOf	The number of indexOf()
	f11	Len_maxStr	The largest length of the string in JS codes
	f12	EntropyValue_maxstr	The information entropy of the longest string
	f13	Num_replace	The number of replace()
	f14	Num_join	The number of join()
Special Behaviors	f15	Num_createElement	The number of createElement()
	f16	Num_activeXObject	The number of activeXObject()
	f17	Num_write	The number of document.write() and document.writeln()
	f18	Num_addEvent	The number of addEventListener() and attachEvent()
	f19	Num_child	The number of appendChild() and insertBefore()
	f20	Num_setTimeout	The number of setTimeout() and setInterval()
	f21	Num_iframe	The number of iframe tags
	f22	Num_script	The number of script tags
	f23	Num_keywords	The number of keywords, such as var, return, void, null, function, etc.
Encoding Characters	f24	Num_unicode	The number of Unicode encoding
	f25	Num_octal	The number of Octal encoding
	f26	Num_hex	The number of Hexadecimal encoding

Table 2

New features extraction in our model

Feature type	No	Feature name	Feature description
Code Obfuscation	f27	Num_substr	The number of substring(), substr() and subarray()
	f28	Num_concatFlag	The number of plus ’+’ and concat()
	f29	Ratio_minStr	The ratio of the number of strings whose length is equal to one or two to the number of variables
	f30	Num_bigInteger	The number of the digits which exceed sys.maxsize
	f31	AvgWholeEntropyValue	The average value of entropy in the whole JS file
	f32	Num_maxstrDigitletter	The alternation times of digits and letters in the largest strings
	f33	Ratio_vowelsAlphabet	The ratio of vowels to characters in the largest string
	f34	Ratio_funcVar	The ratio of the number of functions to the number of variables
	f35	Ratio_vowelsfFuncName	The ratio of vowels to alphabets in the largest function names
URL Redirection	f36	Num_srctags	The number of src attribute and href attribute
	f37	Num_docRedirect	The count of document.URL, document.referrer, document.domain, etc.
	f38	Num_location	The number of location.href, location.replace, location protocol, etc.
	f39	Num_cookie	The number of document.cookie, use it to create, read and delete cookies.
	f40	Num_document	The number of document, the attributes and methods of document objects may lead in malicious content.
Special Behaviors	f41	Num_externalFile	The extension of external files in JS codes: ’.php’ and ’.exe’
	f42	Ratio_keywords	The ratio of keywords to tokens in JS codes.
	f43	Exist_image	Whether loading images or fonts or not
Encoding characters	f44	Ratio_specificChar	The ratio of the count of specific characters with the number of tokens in JS codes
	f45	Ratio_digits	The ratio of the number of digits to the number of tokens in JS codes
CSS Attributes	f46	Exist_width	Whether the attribute value of with or height is set to 0
	f47	Exist_border	Wheter the attribute value of border or frameBorder is set to 0
	f48	Exist_display	Whether the attribute value of display is set to none
	f49	Exist_opacity	Whether the attribute value of opacity is less than 1, the attribute value of allowTransparency is set to true, etc.

3.2 Random forest algorithm

Random forest is a supervised learning algorithm composed of many decision trees, and there is no correlation between each tree. When building each decision tree, it needs to pay attention to random sampling and complete splitting. The specific process for constructing each decision tree is as follows:

N represents the number of training samples. M represents the number of features.

Row sampling, using random sampling with replacement sampling N times from N training samples to form a training set. There may be repeated samples in the new training set.

Column sampling, randomly selecting m(m ⪡ M) from M features and then selecting an optimal feature from m features to divide the left and right subtrees of the decision tree. This enhances the generalization ability of the model.

Then, the method uses completely splitting to build a decision tree. Because building a tree is random, there is no need for pruning and no over-fitting.

When inputting a new sample, the category with the most significant number of votes is the final result among the classification results of decision trees.

The Random forest can handle high-dimensional data without feature selection compared with other machine learning algorithms. Moreover, the training speed is fast. Also, it can easily measure the importance of features.

3.3 Explanation method

The purpose of DDIML is to detect whether the JavaScript code is normal or malicious. Understanding why the model predicts some JavaScript codes to be benign and some malicious. The goal is to assess whether we can trust the learned model. In order to achieve the goal, this paper mainly uses SHAP [27] to make a reasonable interpretation on the prediction of the DDIML model.

Using the feature importance API of Random forest to calculate the importance of features, we can initiate the importance ranking of features in the model. The features of the top 20 are shown in Fig. 2. Among the top 20 features, our new features account for over half, fully showing the usefulness and importance of new features to the model prediction. However, feature importance only tells which feature is essential. Hence we do not know how features affect the prediction results and how much influence they have. Simply, we can use PDP (Partial Dependency Plots), and LIME (Local Interpretable Model-agnostic Explanations) [28] to explain the effect of the features on the model prediction. PDP needs to satisfy the independence of features. If there is a correlation between the features, the desired results are unreasonable. However, LIME ignores the feature correlation. We use the Pearson correlation coefficient to calculate the correlation among features, as shown in Fig. 3. It can be seen from the figure that no feature is entirely independent of others. So we choose to use SHAP to explain the model to deal with the problem.

Fig. 2

Feature importance.

Fig. 3

Feature interaction.

SHAP (SHapely Additive exPlanation) can interpret the output of any machine learning model. Inspired by cooperative game theory, SHAP constructs an additive explanatory model in which all features are contributors. The SHAP value is the contribution degree of each feature in a sample, which can be correctly estimated even if the features are correlated. Assuming that the ith sample is x_i, the jth feature of the ith sample is x_ij, the model predicts y_i for this sample. y_base represents the mean value of the predicted results of all samples. The SHAP value obeys equation 1. $y_{i} = y_{base} + f (x_{i 1}) + f (x_{i 2}) + \dots + f (x_{ij})$ (1)

Where f (x_ij) is the SHAP value of x_ij, and the most significant advantage of SHAP value is that SHAP can reflect the influence of each feature of the sample on the prediction result. The influence also shows positive and negative.

4 Experiments

4.1 Experimental setup

This paper studies the detection of drive-by download attacks based on static analysis. We use regular matching to extract features and optimize regular expressions to make the features more optimal and less time-consuming. Moreover, we use scikit-learn to implement machine learning algorithms. The specific experimental configuration is described in Table 3.

Table 3
Experimental environment configuration

Configuration Information

Operating system Windows 10

System configuration CPU: Intel i5-8600 @ 3.60GHz; RAM: 8GB

Python library Scikit-learn: 0.20.2; Matlpotlib: 3.0.3; Chardet: 3.0.4; Anaconda: 4.8.2

Configuration	Information
Operating system	Windows 10
System configuration	CPU: Intel i5-8600 @ 3.60GHz; RAM: 8GB
Python library	Scikit-learn: 0.20.2; Matlpotlib: 3.0.3; Chardet: 3.0.4; Anaconda: 4.8.2

Datasets. In this paper, 4107 samples are used, including 2085 normal samples and 2022 malicious samples. We crawled 20,000 normal web pages from the top 100 websites in Alexa [29] and extracted JavaScript codes from them to save as.js files. In order to balance with the malicious dataset, we randomly select 2805 JavaScript files to constitute the normal dataset. The malicious samples are from Github(All in MJDetector [30] and part in javascript-malware-collection [31]).

Experiment Design. In experiment I, setting the proportion of training set to test set is 8:2, comparing the traditional machine learning algorithms XGBoost, CART, LogisticRegression, SVM, and KNN with Random forest as a benchmark experiment. We use grid-search to determine the optimal parameters. The main parameters of our Random forest classifier are n_estimators=177, max_depth=21 and max_feature=8. In experiment II, each classifier detects samples with 10-fold cross-validation. K-fold cross-validation randomly divides the dataset into K subsets, one of which is used for the test set, and the remaining K-1 subsets are used for the training set. Cross-validation enables the model to make better use of the samples. The results of experiments I and II are displayed in Table 4. In experiment III, compared the method of this paper with Fass (2018) [8], Nayeem (2020) [26], Phung(2021) [32] and JSContana(2021) [33], the results are shown in Table 5. In experiment IV, we crawl 69,994 JS files from the top 100 websites in Github. These samples are used to evaluate the detection of our model in the natural environment, and the result is depicted in Table 6. We also design experiment V, which explains DDIML with the SHAP method to prove that it is credible.

Table 4

Results in experiment I and II

Partitioned dataset methods	Machine learning methods	Accuracy	Precision	Recall	F1-score	AUC
trainingSet80% : testingSet20%	XGBoost	0.971	0.970	0.969	0.971	0.992
	CART	0.961	0.945	0.974	0.961	0.962
	LogisticRegression	0.922	0.927	0.908	0.922	0.963
	SVM	0.904	0.838	0.989	0.904	0.983
	KNN	0.950	0.931	0.967	0.950	0.983
	RandomForest	0.982		0.983		0.980		0.982	0.998
ten-fold cross-validation	XGBoost	0.970	0.974	0.966	0.969	0.993
	CART	0.942	0.942	0.934	0.936	0.940
	LogisticRegression	0.908	0.943	0.868	0.900	0.947
	SVM	0.891	0.834	0.981	0.900	0.978
	KNN	0.935	0.930	0.939	0.934	0.968
	RandomForest		0.975		0.985		0.965	0.974	0.996

Table 5

Results in experiment III

Methods	Accuracy	Precision	Recall	F1-score	AUC	Detection time for each sample(ms)
Fass(2018)[8]	0.977	0.984	0.969	0.976	0.994	86.622
Nayeem(2020)[26]	0.961	0.969	0.949	0.961	0.979	477.538
Phung(2021)[32]	0.964	0.959	0.967	0.963	0.971	127.300
JSContana(2021)[33]	0.982	0.981	0.981	0.981	0.995	284.946
Our model	0.982	0.983	0.980	0.982	0.998	16.070

Table 6

Results in experiment IV

Total JS files	Detection results		False positive rate
	normal	malicious
69,994	69,642	352	0.00505

4.2 Experiment evaluation

The paper regards normal JavaScript codes as negative samples and malicious as positive samples. We adopt five evaluation metrics: accuracy, precision, recall, F1-score, and AUC to evaluate the prediction results. Accuracy represents the proportion of samples that are predicted correctly in all samples, as shown in Formula 2. $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ (2)

Where TP indicates the number of correct predictions of positive samples. TN represents the number of correct predictions for negative samples. FP represents the number of negative samples predicted as positive samples. FN is the number of positive samples predicted as negative samples.

Precision is the proportion of positive samples with correct prediction in all positive samples and is defined as Formula 3. $Precision = TP / (TP + FP)$ (3)

We can see from Formula 4 that Recall is the proportion of the correct positive samples predicted by the classifier to all positive samples. $Recall = TP / (TP + FN)$ (4)

F1-score is used to measure the weighted harmonic means of Precision and Recall, as shown in Formula 5. $F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}$ (5)

AUC indicates the area under the ROC curve. The ROC curve represents the relationship between FPR(x-axis) and TPR(y-axis), where FPR is the proportion of negative samples that the classifier predicts incorrectly to all negative samples, as shown in Formula 6, and TPR is equal to Recall. $FPR = FP / (FP + TN)$ (6)

4.3 Experiment results

4.3.1 Detection results analysis

Experiment I and II. It can be observed from Table 4, comparing XGBoost, CART, LogisticRegression, SVM, and KNN classifiers with Random Forest, that the Random Forest classifier is optimal among all evaluation metrics. In Experiment I, the precision is 0.983, the recall rate is 0.980, and the area of the ROC-AUC curve is 0.998. In Experiment II, we use ten-fold cross-validation to divide the dataset. Compared with Experiment I, the Random Forest classifier indicators are reduced by up to 1.2%, but evaluation metrics of other classifiers are significantly reduced. Among them, the recall rate of CART and Logistic regression is at least reduced by 4%, the accuracy of SVM is reduced by 1.3%, and the recall rate of KNN is reduced by 2.8% . These results show that the Random Forest classifier has good robustness and the best classification effects.

Experiment III. In Experiment III, we use the methods proposed by Fass(2018), Nayeem(2020), Phung(2021), and JSContana(2021) to perform experiments on the dataset in this paper and compare them with our proposed method. Fass(2018) uses JavaScript abstract syntax tree(AST) to extract the features based on semantic information and then use Random Forest classifier for prediction. Nayeem(2020) extracts 30 features from three aspects: runtime, structural, and URL lexical, then applies a 4-layer fully connected network for classification.Phung(2021) used Doc2Vec to extract feature vectors from JavaScript codes and used 2 Bi-directional RNN layers for prediction. JSContana(2021) converted JavaScript codes to AST, obtained the sequence of grammatical units, and then obtained the contextual representation through dynamic word embedding. Meanwhile, TextCNN is used to extract critical features. The results in Table 5 show that our method has good detection accuracy, precision, recall, and performance. Mainly, it takes only 16.07ms on average for each sample detection with our model, which is much lower than other methods. It is sufficient to indicate that our model has a relatively good detection effect.

Experiment IV. We can see from Table 6, among 69,994 JS files, 352 are detected as malicious. We analyze the detected malicious JS files and discover some suspicious behaviors in these files, but we are not sure whether they are malicious or not. For example, some files dynamically create script tags. Unknown URLs are loaded through the src attribute, and the loaded URLs may be malicious. Some dynamically create iframe tags, audio tags, or other tags in the DOM and set some CSS attribute values to make the elements invisible, which is difficult for users to discover abnormalities. Others generate long obfuscated strings by base64 encryption, and so on. The FPR is about 0.00505, which proves the feasibility of our model to detect in the real world.

4.3.2 Model interpretability analysis

In Experiment V, we apply the SHAP method to perform the model’s local and global explanations.

Local Explanation. Local explanation focuses on explaining how to generate a single prediction. This section discusses the local explanation of the model by analyzing a single sample. We randomly select two samples from our dataset. Figure 4 shows the contribution of each feature in the two samples to the prediction results. The contributions of features push the sample’s prediction result from the base value (set to 0.5) to the final value(the value shown in bold). Moreover, the feature that pushes up the result is represented in red. In other words, the feature corresponding to red has a positive effect on the model predicting that one sample is malicious. Relatively, the feature is represented in blue and has a positive effect on the model predicting that one sample is benign. Meanwhile, the block length corresponding to each feature indicates the contribution of the feature to the prediction. In Fig. 4(a), the bold value is lower than the base value, which shows that our model convinces the probability of the sample occurring drive-by download attack is 0.02. Namely, there is a 98% certainty that the sample is benign. We can see from Fig. 4(a), there is no String object, no connectors, and the length of the longest string does not exceed 100, etc. These features have a great impact on determining whether the sample is benign.

Fig. 4

The interpretation of two randomly selected samples in the model.

Similarly, in Fig. 4(b), the bold value is higher than the base value, which shows that our model has an 85% certainty that the sample is malicious. As depicted in Fig. 4(b), there are many Unicode characters, the length of the longest string is up to 175, the number and letter alternately appear 51 times in the longest string, and the average entropy value of the whole file is equal to 1.118, etc. These features increase our model’s probability of predicting that the sample has a drive-by download attack. Finally, under the influence of all features, the model predicts that the sample has a 0.85 probability of being malicious. At the same time, these features shown in Fig. 4 are also in the top 20 of features ranking.

Global Explanation. This part focuses on analyzing the relationship among feature values, SHAP values, and prediction results from the model’s overall structure. The SHAP summary plot combines feature importance and feature influence. As shown in Fig. 5, the figure shows all the sample points, where the color represents the size of feature values, and the abscissa represents the size of SHAP values. For features such as num_concatFlag, entropyValue_maxstr, num_maxstrDigitletter, ratio_digits, etc., the larger their feature values(the redder the color), the larger SHAP values. Therefore, as their feature values increase, these features positively affect the model that predicts one sample is malicious. For some other features, such as ratio_vowelsfFuncName, ratio_vowelsAlphabet, etc., as their feature values increase, the SHAP values decrease, which means there is a positive effect on the model that predicts a sample is normal. In addition, SHAP also provides a method to calculate feature importance. It takes the average of the absolute SHAP value of each feature as the feature importance. The features shown in the figure are ranked in the top 20 of feature importance. Compared with Fig. 2, 18 of the top 20 features are the same, except for slight order changes.

Fig. 5

Summary plot of the model.

In Fig. 5, we can see the relationship between the feature values and the prediction impact on the whole, but the relationship is not definite. We specifically analyze the impact of the single feature value on the prediction by drawing the SHAP dependency plot. We choose some features as examples, such as num_concatFlag, num_maxstrDigitletter, entropyValue_maxstr, and ratio_digits. As depicted in Fig. 6, the x-axis represents the value of a certain feature, and the y-axis represents the SHAP value. The overall tendency for all plots is that as the feature value increases, the SHAP value increases. When attackers split malicious codes and then splice them, many connectors in samples may appear. Alternating numbers and letters in the longest string frequently appear, which may cause poor readability of the code. And as the entropy value of the largest string increases, the degree of code confusion increases. When there are many digits in a sample, code obfuscation may occur. These features positively affect the model, which predicts a sample is malicious. Moreover, it can be seen from Fig. 6 that some sample points are scattered, and this is because these features interact with other features.

Fig. 6

Dependence plots about the effect of a single feature on prediction.

According to Fig. 3, we select the closely related features to the features in Fig. 6, where num_concatFlag is most relevant to ratio_minStr, num_maxstrDigitletter is most relevant to len_maxStr, entropyValue_maxstr is most relevant to ratio_vowels-Alphabet and ratio_digits is most relevant to ratio_minStr. The influence of two highly related features on prediction is shown in Fig. 7. In Fig. 7(a), x-axis is denoted as num_concatFlag, and y-axis on the right is denoted as ratio_minStr. It can be observed from Fig. 7(a) that as the connector increases, the number of short strings increases(the color turns red). Perhaps it is because attackers use connectors to splice short strings to generate malicious code. In Fig. 7(b), with the increase of the number of alternating digits and letters in the largest strings, the length of the largest string in samples becomes larger and larger. This could lead to code obfuscation in the longest string for some samples. In Fig. 7(c), when the ratio of vowels to letters in the largest string exceeds 0.5, the entropy value of the largest string is generally less than 5, which has a positive effect on predicting that a sample is benign. As you can see from Fig. 7(d), as the number of short strings increases, the number of digits in some samples also increases. It indicates that some digits are in the form of strings.

Fig. 7

Dependence plots about the effect of two related features on prediction.

In summary, we explain the model by analyzing the feature importance, the feature correlation, the contribution of each feature in a single sample, and the relationship between feature values and SHAP values. The analysis shows that the new features significantly impact the model prediction, and these new features are feasible for detecting drive-by download attacks.

5 Conclusion

This paper proposes DDIML, an interpretable machine learning model based on novel features. Explaining the model proves that the features we extracted are reasonable, and our model can be trusted. In our dataset, compared with other machine learning methods and previous works, our model gives relatively good results, with a precision of 0.983, a recall rate of 0.980, and the average detection time for each sample is only 16.07ms. The model takes much less time than the abstract syntax tree and deep learning methods. However, the recall rate is only 0.965 with 10-fold cross-validation. Our model still needs to be improved. We will focus on distinguishing benign confusion and malicious obfuscation in future work. We could extract multiple feature sets and integrate multiple classifiers to detect drive-by download attacks. Also, we will try more methods to explain our model better to increase transparency and credibility.

Footnotes

Acknowledgment

This paper is supported in part by National Natural Science Foundation of China (U20B2045). We thank the Reviewer for their positive comment and careful review, which helped improve the manuscript.

References

Sfakianakis Andreas , Douligeris Christos , Marinos (ENISA) Louis , Lourenc , o (ENISA) Marco and Omid Raghimi , Enisa threat landscape report 2018, 2018. https://www.enisa.europa.eu/publications/enisa-threat-landscape-report-2018

Tencent Anti-Virus Lab, 2018 q3 615 quarter security report, 2018. https://tav.qq.com/index/newsDetail/337.html

Cncert internet security threat report, 2020. https://www.cert.org.cn/publish/main/upload/File/CNCERT202009.pdf

Peter Likarish , Eunjin Jung and Insoon Jo Obfuscated malicious javascript detection using classification techniques, In 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pages 47–54. IEEE, 2009.

Charles Curtsinger , Benjamin Livshits , Benjamin Zorn and Christian Seifert Zozzle: Low-overhead mostly static javascript malware detection, Proceedings of the usenix security symposium, pages 3–3, 2011.

Davide Canali , Marco Cova , Giovanni Vigna and Christopher Kruegel , Prophiler: a fast filter for the large-scale detection of malicious web pages, In Proceedings of the 20th international conference on World wide web, pages 197–206, 2011.

Nayeem Khan, , Johari Abdullah and Adnan Shahid Khan , Defending malicious script attacks using machine learning classifiers, Wireless Communications and Mobile Computing, 2017, 2017.

Aurore Fass , Robert Krawczyk

, Michael Backes and Ben Stock , Jast: Fully syntactic detection of malicious (obfuscated) javascript, In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 303–325. Springer, 2018.

Samuel Ndichu , Sangwook Kim , Seiichi Ozawa , Takeshi Misu and Kazuo Makishima , Samuel Ndichu, Sangwook Kim, Seiichi Ozawa, Takeshi Misu and Kazuo Makishima, A machine learning approach to detection of javascript-based attacks using ast features and paragraph vectors, Applied Soft Computing 84 (2019), 105721.

10.

Samuel Ndichu , Sangwook Kim and Seiichi Ozawa , Deobfuscation, unpacking, and decoding of obfuscated malicious javascript for machine learning models detection performance improvement, CAAI Transactions on Intelligence Technology 5(3 (2020), 184–192.

11.

Yong Fang , Cheng Huang , Yu Su and Yaoyao Qiu , Detecting malicious javascript code based on semantic analysis,, Computers & Security 93 (2020), 101764.

12.

Aurore Fass , Michael Backes and Ben Stock , Jstap: a static pre-filter for malicious javascript detection, In Proceedings of the 35th Annual Computer Security Applications Conference, pages 257–269, 2019.

13.

Xuyan Song , Chen Chen , Baojiang Cui and Junsong Fu , Malicious javascript detection based on bidirectional lstm model, Applied Sciences 10(10) (2020), 3440.

14.

Shoya Morishige , Shuichiro Haruta , Hiromu Asahina and Iwao Sasase , Obfuscated malicious javascript detection scheme using the feature based on divided url, In 2017 23rd Asia-Pacific Conference on Communications (APCC), pages 1–6. IEEE, 2017.

15.

Jack Stokes

, Rakshit Agrawal , Geoff McDonald and Matthew Hausknecht , Scriptnet: Neural static analysis for malicious javascript detection, In MILCOM 2019-2019 IEEE Military Communications Conference (MILCOM), pages 1–8. IEEE, 2019.

16.

Junxia Guo , Qiyun Cao , Rilian Zhao and Zheng Li , Improving detection accuracy for malicious javascript using gan, In International Conference on Web Engineering, pages 163–170. Springer, 2020.

17.

Yi-Min Wang , Doug Beck , Xuxian Jiang , Roussi Roussev , Chad Verbowski , Shuo Chen and Sam King , Automated web patrol with strider honeymonkeys, In Proceedings of the 2006 Network and Distributed System Security Symposium, pages 35–49, 2006.

18.

Mitsuaki Akiyama , Makoto Iwamura , Yuhei Kawakoya , Kazufumi Aoki and Mitsutaka Itoh , Design and implementation of high interaction client honeypot for drive-by-download attacks, IEICE Transactions on Communications 93(5) (2010), 1131–1139.

19.

Gaya Jayasinghe

, Shane Culpepper

and Peter Bertok , Efficient and effective realtime prediction of drive-by download attacks, Journal of Network and Computer Appli- cations 38 (2014), 135–149.

20.

Yinxing Xue , Junjie Wang , Yang Liu , Hao Xiao , Jun Sun and Mahinthan Chandramohan , Detection and classification of malicious javascript via attack behavior modelling, In Pro-ceedings of the 2015 International Symposium on Software Testing and Analysis, pages 48–59, 2015.

21.

Marco Cova , Christopher Kruegel and Giovanni Vigna , Detection and analysis of drive-by-download attacks and malicious javascript code, In Proceedings of the 19th inter- national conference on World wide web, pages 281–290, 2010.

22.

Dewald

, Rieck

and Krueger

Cujo: efficient detection and prevention of drive-by-download attacks, In Twenty-sixth Computer Security Applications Conference, 2010.

23.

Junjie Wang , Yinxing Xue , Yang Liu and Tian Huat Tan , Jsdc: A hybrid approach for javascript malware detection and classification, In Proceedings of the 10th ACM Sym- posium on Information, Computer and Communications Security, pages 109–120, 2015.

24.

Xincheng He , Lei Xu and Chunliu Cha , Malicious javascript code detection based on hybrid analysis, In 2018 25th Asia- Pacific Software Engineering Conference (APSEC), pages 365–374. IEEE, 2018.

25.

Dharmaraj Patil

and Patil

J.B.

, Detection of malicious javascript code in web pages,(19), Indian Journal of Science and Technology 10(19) (2017), 1–12.

26.

Nayeem Khan

, Mohammad Alzaharani

Hushmat Kar

, Hybrid feature classification approach for malicious javascript attack detection using deep learning, International Journal of Computer Science and Information Security (IJCSIS) 18(5) (2020).

27.

Scott Lundberg

, Su-In Lee , A unified approach to interpreting model predictions, In Advances in neural infor-mation processing systems, pages 4765–4774, 2017.

28.

Christoph Molnar , Interpretable Machine Learning, Lulu, com, 2020.

29.

Cooper

. Alexa, 2020. https://www.alexa.com/

30.

Cha

, He

and Xu

, Mjdetector.https://github.com/njuhxc/MJDetector

31.

HynekPetrak , javascript-malware-collection, https://github.com/HynekPetrak/javascript-malware-collection

32.

Mimura

and Phung

N.M.

, Detection of malicious javascript on an imbalanced dataset, Internet of Things 13 (2021)100357.

33.

Zhang

, et al., Huang

, Li

, Jscontana: Malicious javascript detection using adaptable context analysis and key feature extraction, Computers & Security 104 (2021)102218.