Abstract
A convolutional neural network combined with attention mechanism and a parallel joint algorithm model (CATTB) of bidirectional independent recurrent neural network are proposed. The algorithm extracts the relocation feature and the “texture fingerprint” feature for expressing the similarity of the URL (Uniform Resource Locator) binary file content of the malicious web page, and uses the word vector tool word2vec to train the URL word vector feature and extract the URL static vocabulary feature. CNN (Convolutional Neural Network) is used to extract deep local features. Secondly, Attention mechanism adjusts weight and BiIndRNN (Bidirectional Independently Recurrent Neural Network) to extract global features. Finally, softmax is used for classification. This paper extracts more comprehensive features from different angles and using different methods. The experimental results show that the test results are higher than other researchers, and compared with other algorithms, the proposed CATTB algorithm improves the accuracy of malicious web page detection.
Keywords
Introduction
With the continuous development of Internet technology, the network has penetrated into every corner of people’s lives, and the number of web links has increased day by day. The detection of malicious web pages has become more and more important and more difficult. Malicious web pages are rapidly and diversely driven by economic interests. Malicious web pages pose a serious threat to personal privacy and property security. In recent years, cases of maliciously stealing personal information and propertyby the Internet have frequently occurred, so the strengthening of anti-malicious webpage research has become an urgent task for the development of the network. How to correctly and effectively analyze and detect malicious web pages has become an important research topic. The previous research is mainly based on the traditional algorithm model, the development scope is narrow, and the development speedis slow.
However, the emergence of deep learning has brought new development opportunities. Deep learning not only affects the learning of traditional algorithm models, but also has been used in natural language processing, speech recognition, image processing and so on. At the same time, it has also been applied in the field of web page detection, but the previous research is too singular, and the serial joint algorithm can not fully learn all the key features. Therefore, a CATTB parallel joint algorithm is proposed in this paper. It is able to learn key features more fully than serial joint algorithms. In the depth model, when the input data has a dependency and is a sequential pattern, all outputs are more independent because there is no strong correlation between the previous input and the next input of the CNN [1]. Therefore, the results of CNN are generally not very good. The LSTM [2] model takes full account of the relationship between sequence contexts and long-distance information. However, the input sequence is encoded into a fixed length expressed in vectors, regardless of length, which affects decoding. While the Attention mechanism [3] retains the advantages of LSTM while making up for this disadvantage, Independently Recurrent Neural Network (IndRNN) [4] can handle sequences with more than 5000 steps. And IndRNN can retain long-term memory and process long sequences. IndRNN can also make good use of unsaturated functions such as relu as the activation function, and it is very robust after training. Gradient disappearance and explosion problems can be effectively solved by adjusting the time-based gradient back propagation. The basic idea of BiIndRNN is to present each sequence forward and backward separately to two separate hidden states to capture past and future information. The two hidden states are then joined together to form the final output, providing more comprehensive information. A model that complements the advantages and disadvantages of the three models can effectively extract the intrinsic features of the URL.
This article describes the work related to malicious web page detection in Section 2. Section 3 introduces the algorithm model, which introduces each type of feature in detail and introduces the model. Section 4 conducts experimental analysis, which is introduced from the experimental data, experimental environment, and optimal parameter settings, and compares the characteristics, parallel joint and series combination, pooling layer size and comparison with the traditional single algorithm model. Section 5 summarizes the work of this article.
Related work
For the problem of malicious webpages detection, researchers have proposed many detectiontechnologies and solutions. Sha et al. [5, 6] proposed a self-learning lightweight webpages classification method Self-learning Light-Wight (SLW). For the first time, SLW has introduced the concept of access relations, which has the characteristics of feedback and self-learning. From the existing set of malicious web pages, SLW automatically discovers the users with low credibility and the corresponding access relationships, and further uses the access relationship of the low-credibility users to other web pages to discover the unknown malicious URL set. Lin et al. [7] proposed a segmentation model that quickly calculates malicious URLs by constructing a 3-gram inversion index as a term. This method uses a random domain name recognition technology based on Jaccard to determine a malicious URL generated by a random domain name. However, it is not perfect to detect malicious URL containing random domain name by simple Jaccard index. Liu et al. [8] designed a multi-model matching algorithm suitable for large-scale URL filtering model. Based on the classical multi-model matching algorithm and the characteristics of URL rules, this algorithm proposes two optimization techniques: optimal window selection and pattern string grouping protocol. However, the true pattern strings such as URLs do not satisfy the assumption that the classic string matching algorithm is designed based on the “uniform distribution of pattern strings and text characters”, and the conditions for evaluation are mainly on random data sets. Zhou et al. [9] made use of the decision tree method to deeply analyze a large number of malicious webpages and normal webpages. In addition to considering the webpages’s own characteristics, a variety of new features are selected to detect malicious webpages, including Google PageRank value [10] and number of search results [11], Alexa traffic information [12], domain name information [13], Web of Trust (WOT) reputation value, etc. Wang et al. [14] used a client-side honeypot dynamic analysis method to implement the Hanging Horse webpages verification system using API HOOK [15] and Sandbox technology [16] and the HTTP Trojan horse network tracking system using Browser Helper Object (BHO) technology [17]. MN Feroz et al. [18] constructed a reliable URL classification framework, and bigrams were used as effective indicators for classifying phishing URLs, proving the effectiveness of bigrams. However, the models established in this paper and the models established by related work are susceptible to criminals. SC Jeeva et al. [19] analyzed the characteristics of URLs and associated rule mining. The rules obtained are interpreted to emphasize the more common features in phishing URLs. The results obtained from rule mining highlight the useful features available in phishing URLs. M Akiyama et al. [20] developed a honeypot-based monitoring system specifically designed to monitor the behavior of URL redirection, and also proposed practical countermeasures for malicious URL redirection. Network security operators can take advantage of the useful information they get from honeypot-based surveillance systems. Most traditional malicious webpages detections are based on blacklists [21], reputation systems [22], hosts [23, 24], and vocabularies [25, 26], honeypot technology [27, 28], intrusion detection technology [29, 30]. However, these traditional malicious web page detection methods have been unable to solve the ever-increasing and diverse malicious web pages. Therefore, this paper integrates multiple features and uses CNN to extract deep local features. Secondly, the Attention mechanism adjusts the weight and BiIndrnn extracts the global features. Finally, using softmax to classify, the multi-layer feature is automatically extracted from the data. The results show that this method is superior to the traditional malicious web page detection method.
The innovations in this article are as follows:
This paper proposes a BiIndRNN algorithm to obtain more comprehensive timingfeatures. This paper extracts the relocation feature, uses the rules of the URL to box the URL, enriches the feature extraction method, and improves the accuracy of malicious web pagedetection. This paper extracts the “texture fingerprint” feature of malicious web pages, and combines image processing technology to complete malicious web page detection. In this paper, the word vector tool word2vec is used to train the URL word vector feature, and the natural language processing technology is used to complete the malicious web page detection. This paper proposes a parallel joint model algorithm to replace the traditional tandem joint model algorithm, which solves the problem that the tandem joint algorithm can not fully learn all the features, and extracts the features more comprehensively, which improves the accuracy of malicious web page detection.
Algorithm model
Feature analysis
The extracted features directly affect the accuracy and precision of malicious web page detection, so the features play a crucial role. Based on the extraction of static features, this paper extracts internal features and global features. The extracted features include relocation features, texture fingerprint features, URL word vector features, host information features, and URL information features.
Relocation features
The relocation feature is to re-shuffle each URL according to certain rules, and detects the URL from another new perspective, so as to improve the accuracy of malicious webpage detection. First, prepare enough boxes and number them in order. According to certain rules, each URL is reshuffled and placed in the prepared box, and the order of the boxes can not be disturbed to form a brand new URL form. For example, suppose there are 229 boxes, and each box is numbered in order and placed in order. This article sorts the URLs according to the total number of occurrences in the dataset, and then enters their own boxes according to the obtained serial numbers. When we fill the empty boxes with zeros, the resulting shape is called the relocation feature. And it is combined with texture fingerprint features, URL word vector features, host information features, URL information features. In the following we have experimented with the effects of different features on the test results and gave the experimental results. At this point we give a URL example to illustrate, as shownin Fig. 1.

Relocation features.
In this paper, based on the similarity of similar malicious pages in the texture, a method of feature extraction and detection of malicious webpages based on texture fingerprints is proposed. By combining image processing technology with malicious webpages detection technology, malicious webpages are mapped to uncompressed gray-scale pictures. The binary character of the domain name of the URL of the phishing webpages is obtained, and the two-dimensional spatial domain texture fingerprint feature within the range of the 8 bit unsigned integer is converted to correspond to the range of the grayscale image gray value to obtain the effective texture fingerprint feature. In order to improve the effectiveness of the texture fingerprint feature, the CATTB parallel joint algorithm model is used to construct the deep network model, and the pre-processed texture features are used to train each layer network step by step, and the accuracy of malicious web page detection is obtained. The overall flow of malicious web page detection based on texture fingerprint features is shown in Fig. 2.

Malicious webpages detection based on texture fingerprint feature.
The concept of word vector was first proposed by Hinton et al. [31]. commonly referred to as “Word Representation”, “Word Embedding” or “Distributed representation”. Compared with the traditional one-hot Representation, the use of dense, low-dimensional real-valued word vectors to represent words can effectively avoid dimensional disasters. At the same time, context information is needed based on constructing word vectors, so word vectors usually have rich semantic and context information [32, 33]. The word vector is a set of language modeling and feature learning techniques in natural language processing [34–36]. Because we also need to process the URL, we abstract it into a matrix or vector, and then perform model training after processing. The purpose of processing is to convert the URL into a computer-friendly data format. In this paper, the word vector feature is obtained by training the word vector model word2vec. The operation of URL processing is simplified to the operation of the N-dimensional space vector, and the cosine similarity between the calculated words indicates the degree of correlation on the URL. The word vector feature is rich in semantic and contextual information and is a good URL abstraction feature. Therefore, this paper combines relocation feature, URL word vector features, texture fingerprint features, host information features and URL information features to improve the accuracy of malicious webpages detection.
Host information features
Host-based functionality is obtained from the URL’s hostname attribute. The location of malicious hosts, the identity of malicious hosts, and the management style and attributes of these hosts can be obtained. There are 20 kinds of host information features extracted in this paper, as shown in Table 1.
Host information features
Host information features
Direct use of the original URL name is not feasible from a machine learning perspective. So the URL string must be processed to extract useful functionality. The characteristics of the URL extracted in this paper are based on the attributes of the URL text, including the position of the first decimal point, the position of the second decimal point, the position of the third decimal point, the number of other characters, etc., and extract 21 URL information features, as shown in Table 2.
URL information features
URL information features
The extracted data must be processed into a suitable format (eg digital vector) so that they can be inserted into ready-made machine learning methods for model training. The ability of these features to provide relevant information is crucial for subsequent machine learning. Because the basic assumption of the machine learning (classification) model is that the characteristic representations of malicious and benign URLs have different distributions. Therefore, the quality URL represented by the feature is crucial to the quality of the malicious URL prediction model obtained by machine learning. Therefore, the quality of feature representation plays a crucial role in the detection of malicious URL prediction models in deep learning. That is, we have incorporated these features in this article.
To ensure the improvement of the accuracy of analysis and detection, relocation feature, texture fingerprint features, host information features, and URL information features are fused, as shown in Table 3.
Feature fusion
Feature fusion
These five features are extracted from different angles and methods, which greatly improves the accuracy of malicious web page detection. In Section 4.4.1, their respective influences are illustrated from experimental data. Through the above code, all the features are matched together, that is, the fusion features. After the feature is fused, the CATTB paralleljoint algorithm model is input, and the CATTB parallel joint algorithm model is trained to obtain the accuracy of malicious web page classification.
Firstly, the attention mechanism and the convolutional neural network are paralleled to output the CATT algorithm, and then the CATT algorithm is connected in parallel with the bidirectional independent recurrent neural network, which is referred to as the CATTB algorithm. Convolutional Network, also known as CNN, is a neural network dedicated to processing data with a similar network structure, consisting of a convolutional layer, a pooled layer, and a fully connected layer. The convolutional layer and the pooling layer cooperate to form a plurality of convolution groups, and the features are extracted layer by layer, and finally the classification is completed through several fully connected layers. CNN simulates feature differentiation by convolution, and reduces the order of network parameters by convolving weight sharing and pooling. Finally, it performs tasks such as classification through traditional neural networks. First, input M ∈ {(x1, y1) , (x2, y2) … (x
n
, y
n
)} from the input layer, where (x1, x2 … x
n
) represents the characteristics of the input, and y
n
∈ (1, 0) represents the label of the URL. The calculation formula for its convolutional layer is:
Where l represents the number of layers, Mj is an input feature, H is a certain neuron, and b is an offset vector.
The pooling layer can not only reduce the size of the input matrix, but also speed up the calculation, and can effectively prevent over-fitting and reduce the feature dimension. The calculation formula for the pooling layer is:
Where l represents the number of layers, M j is an input feature, X is a certain neuron, m represents the pooling layer window size and b is an offset vector.
There is no association between the CNN’s previous input and the next input. So all outputs are independent, but the LSTM model takes full account of the relationship between sequence contexts and long-distance information, and demonstrates a powerful ability to handle text sequence data. However, the input sequence is encoded into a fixed length expressed in vectors, regardless of length, which affects decoding. The Attention mechanism retains the advantages of LSTM and makes up for this disadvantage. First, input M ∈ {(x1, y1) , (x2, y2) … (x
n
, y
n
)} from the input layer, where (x1, x2 … x
n
) represents the characteristics of the input, and y
n
∈ (1, 0) represents the label of the URL, and the weight of the attention is calculated. The formula is:
Where M
i
represents the input feature, k
i
represents the calculated attention weight of the i-th feature, and attention weight H is emphasized, and the output vector G of the attention mechanism is passed. The formula is as follows:
At this point, the CNN and Attention mechanisms are combined in parallel here.
The recurrent neural network can process less than 1000 steps, while IndRNN can process long sequences well and preserve long-term memory. IndRNN can also make good use of unsaturated functions such as relu as the activation function, and it is very robust after training. Gradient disappearance and explosion problems can be effectively solved by adjusting the time-based gradient back propagation. Its formula is:
The recurrent weight u is a vector and •represents the Adama product. Each neuron in the same layer is not connected to other neurons, and neurons can be connected by superimposing two or more layers of IndRNN. For the n th neuron, the hidden layer hn, t can be obtained by the following formula:
Where Wn and Un represent the input weight and current weight of the n th row, respectively.
The basic IndRNN structure is shown in the Fig. 3.

Schematic diagram of IndRNN structure.
Where “weight” and “Recurrent+ReLU” represent the loop process of processing input at each step, and ReLU is the activation function.
The basic idea of BiIndRNN is to present each sequence forward and backward separately to two separate hidden states to capture past and future information. The forward IndRNN (IndRNNForward, IndRNNF) and the backward IndRNN (IndRNNBackward, IndRNNB) are used to run more comprehensive information from front to back and from back to front. The two hidden states are then joined together to form the final output, providing more comprehensive information. The BiIndRNN formulas (7), (8), and (9) are as follows:
Where F t represents the output value after IndRNNF, F t ’ represents the output value after IndRNNB, L t is the final output connected by the two hidden states, M1, Mn-1, M n , Mn+1… represents the feature input, and MLP is used for feature fusion. The bidirectional independent recurrent neural network is shown in Fig. 4 below, and the introduction of formula terms is shown in Table 4 below.

BiIndRNN schematic.
Formula terms
This paper proposes to construct the CNN, Attention mechanism and BiIndRNN algorithm model into a CATTB parallel joint algorithm model for malicious web page detection. The expected algorithm model is complementary to each other to achieve better results in detecting malicious web pages. Its structure is shown in Fig. 5 below:

Schematic diagram of CATTB algorithm model structure.
Dataset
In this article, experimental data is derived from the public dataset PhishTank and crawler URL collection. A total of 13,652 phishing web URL collections are acquired at Phish Tank. Using domain knowledge and related expertise, the crawler grabbed 10,000 sets of legitimate webpages URLs.
Experimental environment
We provide an experimental environment to increase the achievability of this paper. The experimental environment is shown in Table 5.
Experimental environment
Experimental environment
This article uses a unified statistical indicator to evaluate the performance of the model. The total prediction accuracy (Accuracy, Q) formula (1) to measure,
In equation (1), TP represents the number of true positives, FP represents the number of false positives, TN represents the number of true negatives, and FN represents the number of false negatives.
The setting of model parameters plays a crucial role in the effectiveness of malicious web page detection. The optimal parameters of this paper are set as shown in Table 6 below:
Optimal parameter setting
Effect of texture fingerprints on experimental results
Among them, the filters in the table represent the number of first-level convolution kernels. The filter_size in the table indicates the size of the first layer of convolution kernel. The Pool_size in the table indicates the size of the pooling layer. The filter_size2 in the table indicates the number of second-level convolution kernels. The filters2 in the table represent the size of the second layer of convolution kernels. The batch_size in the table indicates the batch amount. The ep in the table indicates the optimal number of iterations for the model training.
In this section, the experimental results of the paper will be analyzed from the influence of different features on the experimental results, the CATTB series joint algorithm, the comparison of the pooling layer size and the parallel joint algorithm on the experimental results, compare with other researchers’ methods, and the experimental results of the six algorithmmodels.
The effect of different characteristics on experimental results
In this section, the experiment is carried out by adding or deleting the relocation feature, the texture fingerprint feature, and the URL word vector feature respectively under different data volumes. And comparing the experiments with all the features, we use the accuracy of the malicious web page detection to illustrate the importance of different features for malicious web page detection. The processing time of the table below is based on the data num of 10,000 and epoch of 1.
From the data in this table, it can be seen that when there is no relocation feature, the highest accuracy obtained in the experiment is 98.17%. When there is no texture fingerprint feature, the highest accuracy obtained in the experiment is 97.82%. When there is no URL word vector feature, the highest accuracy obtained in the experiment is 98.35%. After adding the relocation feature, the texture fingerprint feature, and the URL word vector feature, the highest accuracy rate obtained in the experiment is 98.8%, and the different data volumes have higher accuracy, which proves the validity of these features. Therefore, the relocation feature, texture fingerprint feature, and URL word vector feature have a crucial impact on the accuracy of malicious web page classification.
CATTB parallel and series
At present, in order to achieve better results of malicious web page detection, more and more researchers on the choice of models choose to use the model’s tandem joint algorithm. The experimental results show that the tandem joint algorithm of the model has significantly improved the accuracy of detecting malicious web pages compared with the traditional single algorithm model. However, the tandem joint algorithm can not fully learn all the key features. This paper proposes a CATTB parallel joint algorithm. The processing time of the table below is based on the epoch of 1. In this section, we will compare the CATTB parallel joint algorithm with the CATTB series joint algorithm from multiple angles.
From the experimental data in Table 8, we can see that in the same characteristics, experimental environment, parameters, data volume, the accuracy of malicious web page detection by CATTB tandem joint algorithm is up to 97.5%. The CATTB parallel joint algorithm yields an accuracy of 98.8%. Compared with the CATTB series algorithm, the CATTB parallel algorithm has significantly improved the effectiveness of malicious web page detection. This result is because the tandem algorithm cannot fully learn the deficiencies of all the key features, and the parallel algorithm is different from the tandem algorithm, which can fully learn all the key features.
CATTB parallel comparison parallel
CATTB parallel comparison parallel
The parameters are critical to the impact of the experiment, but the degree of impact is significantly different. The optimal parameters for this paper are listed in Section 3.3. This section takes the size of the pooled layer as an example, showing the effect of the size of the pooled layer on the experimental results, and explaining the setting process of the optimal parameters. As shown in Table 9 below:
The size and impact of the pooling layer
The size and impact of the pooling layer
It can be seen from the above table 9 that in the case where other parameters are fixed and optimal, the accuracy is the highest when the size of the pooling layer is 2 under different data amounts. From this table we can also see that the setting of the pooling layer size has a great influence on the accuracy of the experiment. The highest result is that when the amount of data is 10,000, the size of the pooled layer is set to 2, and the accuracy is 98.8%. The worst result is when the amount of data is 10,000, the size of the pooling layer is set to 3, the accuracy is 97.4%, and the experimental results are significantly improved. It can be seen that the setting of the pooling layer size is crucial.
In this section, we will compare with other researchers using the PhishTank dataset in recent years. Under the same data set, observe the difference of the extracted features, and then analyze the research methods of other researchers, and compare with the malicious web page detection effect of this paper. As shown in Table 10 below:
Comparison with method
Comparison with method
Liu G et al. [37]. extracted the Keyword, Rank, and Hyperlink features. And use the DBSCAN clustering method to find out if there is a possible attack on the webpage cluster attacked by a given webpage, the accuracy is 91.44%. Fatt et al. [38]. proposed a method based on website icons to reveal hidden identity of websites. According to the vocabulary function, the host-based function and the Hyperlink function are used to study the URL with an accuracy of up to 94.11%. Dewan et al. [39]. proposed a broad feature set based on entity configuration, text content, metadata and URL functions to identify malicious content on Facebook in real time and zero hours, the accuracy is 86.9%. Jain et al. [40]. protect users from phishing attacks by automatically updating the whitelist of legitimate sites. The proposed method achieved an accuracy of 86.02%. Jain et al. [41]. proposed a machine learning-based anti-phishing system based on Uniform Resource Locator (URL) feature to detect malicious web pages. The accuracy rate reached 91.28%. Compared with the above methods, the accuracy of the the malicious webpages detection based on relocation feature, URL word vector feature, texture fingerprint feature and static feature proposed in this paper is improved. The accuracy of this paper reached 98.8%.
In order to verify the proposed CATTB parallel joint algorithm model, it is more suitable for the analysis and detection of malicious URLs. Compared with shallow machine learning model k-Nearest Neighborhood (KNN), Gaussion Bayes algorithm (Gaussion NB), random forest (rf), deep learning model Attention, CNN, BiIndRNN. Each model was tested under its own optimal parameters to ensure the validity of the experimental results. The processing time of the table below is based on the data num of 10,000 and epoch of 1. The detection accuracy is shown in Table 11.
Effect of different algorithm models on experimental results
Effect of different algorithm models on experimental results
Based on the results of the above multiple sets of experiments, we can see. A model that complements the advantages and disadvantages of the three models can effectively extract the intrinsic features of the URL. Thereby improving the accuracy of malicious web page detection. In the same characteristics, experimental environment and parameters, the highest accuracy of the rf algorithm is 98.2% in different data quantities. The highest accuracy of the KNN algorithm is 86.52%. The highest accuracy rate of Gauss Bayes is 92.65%. The highest accuracy rate of the attention mechanism algorithm is 66.8%. The highest accuracy of the CNN algorithm is 97.89%. The highest accuracy of the BiIndRNN algorithm is 96.98%. The highest accuracy rate of the CATTB parallel joint algorithm model is 98.8%. Regardless of the amount of data, the experimental results of the CATTB parallel joint algorithm model are higher than the random forest algorithm, KNN algorithm, Gauss Bayes algorithm, Attention, CNN, and BiIndRNN algorithm model in terms of accuracy and accuracy of malicious web page detection. It is a great improvement. Therefore, it can be seen that the CATTB parallel joint algorithm model can greatly improve the accuracy of malicious web page detection, thus demonstrating its effectiveness.
The difference from previous research is that this paper adds the relocation feature on the basis of considering the characteristics of malicious web host information and URL information. Joining the use of the generated word vector model word2vec trainingto get the URL word vector feature, extracting the malicious texture “texture fingerprint” feature. And for the insufficiency of the key features of the tandem joint algorithm, the CATTB parallel joint algorithm model is proposed and applied to malicious web page detection. And in the same experimental environment, the same parameters, respectively, feature comparison, model parallel joint comparison series combination, pooling layer size comparison, compare with other researchers’ methods, and compare with the algorithm model. It can be seen that the CATTB parallel joint algorithm proposed in this paper significantly improves the accuracy of malicious web page detection. And we can also find that because the parallel algorithm has a large amount of computation, the running time is longer than the serial algorithm. In the future, we expect to find ways to optimize this problem.
Footnotes
Acknowledgments
We would like to thank all the participants in our study that provided useful and detailed feedback. Meanwhile, I would thank all my tutor and my team for the research.
This work is partially supported by the Science the Technology Talent Training Project of Xinjiang Uygur Autonomous Region (QN2016YX0051), the Scientific Research Innovation Project of Education Innovation Plan for Graduate Students in Xinjiang Uygur Autonomous Region (XJGRI2017007), the Cernet Next Generation Internet Technology Innovation Project (NGII20170420, NGII20190412), Tianshan Youth Program (2017Q011).
