Abstract
Currently, JavaScript is a popular scripting language for building web pages. It allows website creators to run any program code they want when users are visiting their websites. Meanwhile, malicious JavaScript becomes one of the biggest threats in the cyber world. Researchers are now searching for a convenient and effective way to detect JavaScript malware. Consequently, this paper aims to propose a novel method of detecting the JavaScript malware by using a high-level fuzzy Petri net (HLFPN). First, the web pages are crawled to get JavaScript files. Second, those main features are extracted from JavaScript files. In total, six main features of the JavaScript, including longest word size, entropy, specific character, commenting style, function calls, and abstract syntax tree (AST) features are collected. Finally, an HLFPN model is used to determine whether the malicious code is available or not. The experimental results have fully demonstrated the effectiveness of our proposed approach.
Introduction
JavaScript, HyperText Markup Language (HTML) and Cascading Style Sheets (CSS) are three core technologies in the World Wide Web (WWW) content production. JavaScript is a scripting language that can be learned easily. That is the reason why it is widely used on the Internet [1]. In the light of Microsoft’s recent security reports, the number of malicious JavaScript contents detected was the largest one in the first half of 2013 [2].
The traditional antivirus software often relies on the signature-based technique to detect malware. The signature-based approach is not precise anymore because it can neither resist obfuscation, nor be applied to the evolving variants of malware [3]. Recent malware is usually detected by machine learning or even deep learning, which was implemented in artificial neural network (ANN). Based on the previous research work in [24], ANN has some drawbacks including complicated structures, complex computation, more information storage needed, and so on.
However, this paper proposes a novel method of detecting the JavaScript malware by using a high-level fuzzy Petri net (HLFPN) model. It only needs simple computation, such as matrix multiplication and digital comparison operation. It also has some advantages including simple structures and less information storage needed. In the beginning, similar to other machine learning, it must collect some datasets and then extract their features. First, crawl JavaScript files from the Internet and extract the features of JavaScript code. Finally, use the HLFPN model to determine whether the code is malicious or not. Our research purposes are not only to use a simple reasoning algorithm to detect the malicious JavaScript, but also to have a high degree of precision and recall.
The remainder of this paper is organized as follows: In Section 2, we provide a literature review of the proposed JavaScript malware detection system, including definitions of JavaScript malware, JavaScript malware detection techniques, and HLFPN. In Section 3, we describe the framework of the proposed JavaScript malware detection system by using HLFPN. The experimental results and analysis are presented in Section 4. Finally, the conclusion and future work are remarked in Section 5.
Literature review
In Section 2, we first describe the definitions of JavaScript malware, and JavaScript malware detection techniques. Then, we describe some of the basic definitions of an HLFPN. We also make a decision by using a fuzzy reasoning algorithm.
JavaScript malware
Andra Zaharia [4] stated that JavaScript allows website creators to run any code they want when users visit their website. Therefore, when the code with bugs or the code is improperly implemented, it can create backdoors to the website vulnerable to attackers.
Since the online browsing is one of the most popular ways for users, cybercriminals are aiming at this case. Online attackers often redirect users to compromise websites. They can hack legitimate websites easily. According to Sophos, 82% of malicious sites hacked legitimate sites [5].
When one unintentionally browses an infected website, the malicious JavaScript files might be downloaded on one’s computer. The following are the definitions of an infected website: Cybercriminals have injected malicious JavaScript code into the website and its database. Attackers display the online advertisements/banners on the website by malicious JavaScript code. Cyber attackers have loaded malware or malicious contents from remote servers.
Spreading javascript malware
According to Heimdal security blog [4], there are eight ways to spread malware in the current cyber attacks. Injecting malicious JavaScript code into the legitimate websites: It is used to redirect users to malware sites or to trigger malware infections to exploit servers. Injecting malicious JavaScript code into an online advertising network: It appears in the online banner ads and redirects users to the malicious websites. Hidden iFrames: It downloads JavaScript malware from the compromised websites and then the malware tries to infect the PC by executing code. Compromising JavaScript code to trigger an infected download: It is one of the most common scams on the Internet, such as fake antivirus products. These may compromise your irreversible system. Drive-by download: Infected JavaScript files are used to start malware infections. Fake software pop-up messages: Cybercrooks can be easily forgotten, and they look real and convincing. Malicious JavaScript attachments: These attachments are put through Windows programs. They may trigger the malicious infection outside the browser. Browser plugins: They may be infected or they may load the external content with malware from an external source.
JavaScript malware detection techniques
In this Sub-section, we discuss some recent approaches to detect malicious JavaScript code.
Signature-based technique
Yinxing Xue et al. [6] have stated that the signature-based techniques are usually adopted by antivirus software, which generate a hash value or fingerprint for a malicious sample. Although these approaches can provide a certain level of security for known malware, they fail to detect obfuscated variants with different hash values or fingerprints.
Pattern-based technique
YoungHan, et al. [7] have proposed a novel approach which can detect obfuscated strings in a malicious web page. They extracted three features for detecting obfuscated strings by analyzing malicious JavaScript patterns, namely, N-gram, Entropy, and Word Size. N-gram stores many bytes of each code used in strings. Entropy stores distribution of used bytecodes. Word size determines whether there is a very long string used. However, publishing new malicious scripts and maintaining a pattern-based system become a tedious task.
Machine learning
Machine learning [8] is a scientific discipline that encompasses various methods and related algorithms whose purpose is to learn from datasets. The traditional algorithms, such as K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Naive Bayes (NB), usually rely on feature engineering. A semi-manual task of determining properties of the datasets can be extracted and be fed to an algorithm as input. It can be learned from the algorithm to train a model. In other words, machine learning aims to find hidden rules behind available datasets by using various models. Once a model has been trained, it can be used to predict the unseen datasets.
(1) K-Nearest Neighbor
A classifier is the simplest classification algorithm [9]. It makes classification by measuring the distance between the feature values of different samples. If a sample belongs to a class in most of the k most similar samples (i.e. the nearest neighbors) in the feature space, and the sample also belongs to this class. But it has low accuracy in classification.
(2) Support Vector Machine
The classifier draws a hyperplane in the feature space in order to maximize the distance between all samples of benign and malicious classes [10]. Thus, it needs more complicated computation.
(3) Naive Bayes
The classifier uses Bayes’ theorem with strong assumptions of independence between features to predict whether the sample is benign or malicious. It uses a kernel function to estimate the distribution of numerical attributes rather than the normalized distributions [11]. But it is more case-sensitive.
High-level fuzzy Petri net
Dr. Carl Petri proposed Petri net theory in 1962. In his dissertation, “Kommunikation mit Automaten,” [Communication with Automata] [12]. He formulated the idea for a principle of communication between asynchronous components of a computer system. The Petri net is a graphical and mathematical modeling tool, which is concurrent, asynchronous, distributed, parallel, nondeterministic, and stochastic. It could be used to model and analyze diverse systems. [12] However, along with the advances of the information system, the descriptions of Petri nets are more and more complex. Therefore, scholars or researchers one after another conduct their researches with extended Petri net theory, such as colored Petri net [13], timed Petri net [14], fuzzy Petri net [15], high-level fuzzy Petri net [16] [17] [18] [19] [20] [21], and so on. This paper has followed the HLFPN model to make a decision on the malware detection.
(1) Definitions
The basic definitions and fuzzy reasoning approach are presented as follows:
A finite set of places. A finite set of transitions. P∪ T ≠ ∅ The flow relation, i.e. a finite set of arcs, each representing the fuzzy set (i.e. fuzzy term) of an antecedent or a consequent; where the positive arcs (i.e. THEN parts) are denoted by ⟶. A finite set of linguistic variables, e.g. X, Y, and Z, where X={x1, x2, x3, ... ,x
n
}, Y={y1, y2, y3, ... ,y
m
}, and Z={z1, z2, z3, ... ,z
q
}. A finite set of fuzzy truth values known as the fuzzy relational matrix between the antecedent and the consequent of a fuzzy production rule. An association function, mapping from places to linguistic variables, α(p
i
) = c
i
, i = 1, ... , L, where C = {c
i
} is a set of linguistic variables in the knowledge base (KB), and L is the number of linguistic variables in the KB. An association function, mapping from the flow relations to the fuzzy truth values between zero and one. An association function, mapping from transitions to fuzzy relational matrices.
A set of the input places of transition t. A set of the output places of transition p. A set of the input transitions of place t. A set of the output transitions of place p.
In the IF-THEN-ELSE rule, the ELSE part is denoted by a negation arc ∘ ⟶, and the fuzzy set in the antecedent (i.e., IF part) must be complemented and denoted by ¬, i.e. the negated fuzzy set = 1 – the fuzzy set in the antecedent.
Mem(p): P⟶[0,1], which assigns to each place a real value Mem(p) = DOM(α(p)), where DOM represents the degree of membership in the associated proposition, and data tokens are available in P.
In HLFPN, ∀ transition t, V (t) = ∧ (fuzzy sets in I(t)); and ∀ place p, V(p) = V (fuzzy sets in I(p)), where ∧ denotes min operation and V denotes max operation. This rule is denoted by ∘.
In HLFPN, ∀ place p i ∈ P, if ∀t j ∈ T, p i ∉ O(t j ), then p i is called input place (IP); if ∀ t j ∈ T, p i ∉ I(t j ), then p i is called output place (OP); else, p i is called hidden place.
(2) Fuzzy Reasoning
In the fuzzy reasoning method presented in [21], fuzzy production rules are used. Mamdani’s fuzzy implication rule type [22] is applied throughout this paper. In general, a fuzzy production rule describes the fuzzy relationship between the antecedent and the consequent. Let R be a set of fuzzy production rules, where R = {R1, R2,..., R n }. The general form of the ith fuzzy production rule R i is shown as follows:
R i : IF d j (X is A), THEN d k (Y is B); ELSE, d w (Z is C) ... (V).
where “X is A”, “Y is B” and “Z is C” are propositions; X is called the input linguistic variable; Y and Z are called the output linguistic variables, respectively; A is called the input fuzzy set; B and C are called the output fuzzy sets, respectively; the fuzzy truth values of the propositions “X is A”, “Y is B” and “Z is C” are restricted to [0, 1]; “X is A” is the antecedent of fuzzy production rule R i , “Y is B” and “Z is C” are the consequents of fuzzy production rule R i . Let V denote the fuzzy relational matrix between the antecedent and the consequent of a fuzzy production rule.
Let us consider the fuzzy production rule R1 shown as follows:
R1: IF the temperature (X1) is hot (A1) AND the sky (X2) is cloudy (A2), THEN the humidity (Y) is high (B).
Based on the transformation procedure presented in [19], we can transform the above-referenced fuzzy production rule R1 into the following first-order logic form:
Then, the HLFPN model is shown in Fig. 1.

HLFPN for Example 1.
Assume that the fuzzy sets A1, A2 and B are shown as follows:
By the cylindrical extension operations [23], we can obtain the fuzzy set A of antecedent, shown as follows:
Then, the fuzzy relational matrices V1(t1), V2(t2) and V3(t3) between the antecedent and consequent of fuzzy production rule R1 can be obtained, shown as follows:
The most widely used fuzzy reasoning method is the max–min compositional inference ∘ [24]. Assume that the input fuzzy sets
Then, we can get
Finally, we can obtain
The above description is the fuzzy reasoning process of HLFPN.
(3) Fuzzy Reasoning Algorithm
In this Sub-section, we briefly evaluate fuzzy reasoning algorithm (FRA) from [5] to decide whether there exists or not a fuzzy relational matrix between the antecedent and the consequent of a fuzzy production rule.
In this Section, the procedure of using HLFPN to successfully detect malicious and benign JavaScript code is introduced. As described in [25], the most malicious JavaScript is obfuscated and the obfuscated feature becomes much complicated. Our goal is to detect the malicious JavaScript with a high degree of precision and recall. There are three stages of the procedure, including data collection, feature extraction, and detection of the malicious JavaScript using HLFPN, as shown in Fig. 3.

Procedure of fuzzy reasoning.

Flowchart of JavaScript malware detection using HLFPN.
We write a Python [26] program to download the web pages. Python is a popular programming language for developing a web crawler. After downloading the web pages, we use Beautiful Soup [27] which is a Python library to pull the JavaScript code out of HTML files.
(1) Benign JavaScript Collection
We use Alexa.top500 websites [28] as the initial seed to crawl a portion of the websites. We only download the textual content and ignore the visual and aural content.
(2) Malicious JavaScript Collection
Due to the short life of malicious scripts, it is complex to collect the malicious JavaScript samples. We download the JavaScript malware collection on GitHub [29], and collect the instances of malicious scripts from MalwareURL [30].
Feature selection
The importance of the features directly affects the accuracy of a classifier. Through the analysis of JavaScript and Drive-by-download attack [31], we extract 6 features of the JavaScript, including longest word size, entropy, specific character, commenting style, function calls, and abstract syntax tree (AST) features. From the above features we have stored a total of seven values, as shown in Table 1.
Feature name and description list
Feature name and description list
In the decision method, seven features are used, namely, the longest word size of JavaScript code (lw), the uncertainty of JavaScript code (ent), occurrence frequency of specific characters (sc), usage frequency of comment styles (cs), usage frequency of function calls with security risks (fc), the maximum AST depth (de), and the maximum AST width (wi). In addition, the malicious decision is also divided into two parts, namely, “Benign” and “Malicious”.
The membership functions of High, Me dium, and Low, and malicious decision are based on Equation (2). In the analysis, the membership functions of input parameters are set between 0 and 1. Thus, the values of input parameters are converted into the values between 0 and 1 using Equations (3)–(5).
In the previous Sub-section, we have defined the fuzzy sets and their corresponding membership functions. Each input parameter is imported to the fuzzifier to get the membership degree of each set. According to the size of the input parameter, it is changed into an “IF... THEN... “ statement to build a rule base. The activity diagram is shown in Fig. 4.

The activity diagram of fuzzy reasoning.
Assume that the input linguistic variables are lw, ent, sc, cs, fc, de, and wi with fuzzy terms: high (H), medium (M), and low (L). And assume that the decision (D) is an output linguistic variables: strong (S), indeterminate (I) and weak (W) The fuzzy production rules are defined as follows:
R1: IF lw is H THEN D is S
R2: IF lw is M THEN D is I
R3: IF lw is L THEN D is W
R4: IF ent is H THEN D is W
R5: IF ent is M THEN D is I
R6: IF ent is L THEN D is S
R7: IF sc is H THEN D is S
R8: IF sc is M THEN D is I
R9: IF sc is L THEN D is W
R10: IF cs is H THEN D is S
R11: IF cs is M THEN D is I
R12: IF cs is L THEN D is W
R13: IF fc is H THEN D is S
R14: IF fc is M THEN D is I
R15: IF fc is L THEN D is W
R16: IF de is H THEN D is S
R17: IF de is M THEN D is I
R18: IF de is L THEN D is W
R19: IF wi is H THEN D is S
R20: IF wi is M THEN D is I
R21: IF wi is L THEN D is W
Then, the fuzzy production rules are transformed into the HLFPN model, as shown in Fig. 5; and the parameters are described in Table 2.

HLFPN model representing 21 fuzzy production rules.
Description of parameters
According to the proposed HLFPN model, the reasoning is done by using fuzzy technical indices and fuzzy production rules in the rule base. Based on the reasoning procedure, we use the standard operators for simple calculations. Finally, we use the weighted average defuzzification method to get the output and determine whether the JavaScript is benign or malicious.
In Section 4, we intend to verify the proposed system and demonstrate that the proposed JavaScript malware detection system is feasible. First, show an example to illustrate the HLFPN model for fuzzy reasoning. Then, crawl the datasets and apply our proposed approach. The experimental environment and performance analysis are also under discussion.
Experimental environment
We have collected benign and malicious datasets and put them into our system for detection. Python language was used. All our required LIBs were extended in the Python library [32].
Because this paper focuses on a novel malware detection approach, we only generate a simple user interface. The Python interface we used to perform the experiments is shown in Fig. 6.

Python interface.
In this Sub-section, we present the experimental procedure of our approach as follows:
(1) Collect data and extract features
As we have stated in Section 3, we must firstly crawl the web pages to get JavaScript files and then extract the features of JavaScript code, and save it as a feature file (i.e. train(B).csv). The Comma-Separated Values (CSV) file stores the table datasets in plain text format. Each line of the file is a data record, and each record is separated by commas. It is shown in Fig. 7.

The records of CSV file.
(2) Detect JavaScript malware by HLFPN
In this step, we conducted two experiments. The ratio of training datasets to testing datasets in the two experiments is both 6 to 4. In the beginning, we used the fixed number of datasets to train and test. The fixed number of datasets benign and malicious each is 32. Next, we used the datasets randomly.
(3) Measure the precision and recall
Experiment 1 directly gets the experimental results, and experiment 2 gets the average value in five times of experimental results.
As tabulated in Table 3, we conducted two experiments and totally evaluated 5 tests. Each test contains a different number of benign and malicious datasets. After our approach was performed, we have obtained 308 Malicious and 261 Benign decision outputs.
Information of datasets
Information of datasets
The experimental results are almost the same as the number of testing datasets, but in order to perform a fair and comparative evaluation, we compare other common classifiers by precision and recall which are the terminologies of confusion matrix [33]. Precision is the ratio of the number of correct detections to total number of correct and incorrect detections. Recall is the ratio of the number of correct detections to total number of detections in the testing datasets. The confusion matrix is shown in Fig. 8. And the formulas for precision and recall are shown in Equations (6)–(7).

Confusion matrix.
Our approach has successfully detected the malicious code and benign code from the testing datasets. Table 4 presents the performance comparison between our approach and other common classifiers. And Table 5 presents the time complexity of our approach and other common classifiers.
Comparative analysis of JavaScript malware detection results
Time complexity of our approach and other common classifiers
Note: m = no. of the training datasets. n = dimensionality of the features. 3n = no. of transitions. p = no. of places.
Proof: To compute the time complexity for HLFPN, for each feature (e.g. lw), the number of transitions in the consequent is 3. So, for n features, the maximum number of transitions is 3n. And for p places, the maximum computation time is 3np. Therefore, the upper bound of the time complexity for HLFPN is O(3np), where p denotes the number of places, and 3n denotes the maximum number of transitions.
According to the performance analysis results in Table 4, we find that the average precision and recall of our approach can achieve 94.87% and 94.8%, respectively. Moreover, our approach can achieve the precision and the recall as high as 100% in T1. But when using a large scale datasets or datasets with a huge deviation, SVM will slightly perform better than ours. In this case, we can see that our approach is suitable for situations where the training datasets are smaller and the number of malicious and benign datasets are equivalent. However, from the time complexity in Table 5, we can see that our approach has the least time complexity when the number of training datasets is more than 24. So, the experimental results still indicate that our approach is feasible.
This paper has proposed a JavaScript malware detection system using an HLFPN model. To do this, we first need to crawl web pages to obtain JavaScript files and to extract malicious features from them. The features are used as the input to the HLFPN model to determine whether it is malicious or benign.
The contributions of this study are presented as follows: We have successfully detected JavaScript malware by using simple reasoning algorithm and simple mathematics. Our approach is not only quick to get started, but also fast to execute it. The experimental results have shown that our approach is more acceptable for detecting JavaScript malicious code than other existing approaches.
However, the malicious techniques become so complicated that our approach cannot cover each one of them. We will keep on proceeding this work in the future, using an even better reasoning approach to detect and classify malware with high precision and recall. Furthermore, we will turn this approach into something truly practical in the future, such as webpage plugins or malware detection in antivirus software.
Footnotes
Acknowledgments
The authors are very grateful to the anonymous reviewers for their constructive comments which have improved the quality of this paper. Also, this work was supported by the Ministry of Science and Technology, Taiwan, under grant MOST 107- 2221- E-845- 002-MY3.
