Abstract
In the past couple of years, neural networks have gained widespread use in network security analysis. This type of analysis is usually performed in a nonlinear and highly correlated manner. Due to the immense amount of data traffic, the current models are prone to false alarms and poor detection. Deep-learning models can help security researchers identify and extract data features that are related to an attack. They can also minimize the data’s dimensionality and detect intrusions. Unfortunately, the complexity of the network structure and hidden neurons of a deep-learning model can be set by error-prone procedures. In order to improve the performance of deep learning models, a new algorithm is proposed. This method combines a gradient boost regression and particle swarm optimization. The proposes a method called the Spark-DBN-SVM-GBR algorithm. The simulations conducted proposed algorithm revealed that it has a better accuracy rate than other deep learning models and the experiments conducted on the PSO-GBR algorithm revealed that it performed better than the current optimization technique when detecting unauthorized attack activities.
Keywords
Introduction
One of the most critical issues that security management systems have to address is the protection of big data [1]. Due to the rapid growth of the Internet and the increasing number of sources of data, hackers have become more capable of launching attacks. This has led to the development of new tools and techniques that allow them to perform illegal activities [2]. Researchers are constantly developing new techniques and tools that can help them detect and prevent attacks before they can take over a network. One of the most important systems that is used in this field is intrusion detection. This type of security system is designed to monitor the activities of an attacker and prevent them from taking over a network [3].
An IDS is a software or hardware monitor that can analyze and detect attacks on a network. Big data analysis is often the reason why IDS techniques are not as efficient as they should be when dealing with security threats [4]. The complexity of the process involved in analyzing the data makes it difficult for the system to identify and prevent attacks. Big data analysis can be performed in a more efficient manner by using techniques and tools. This can help reduce training time and computation [5].
The IDS has various methods that it uses to detect attacks. One is the signature-based attack detection, which can identify known attacks by looking for their signatures. This method is useful in identifying attacks that are already in the database. However, it can’t detect new types of attacks as their signatures are not presented [6]. This is why anomaly-based detection is used to analyze the current activities of the users against predefined profiles. An anomaly-based detection is usually effective against zero-day attacks or unknown attacks. However, it has high false positive rates. Another type of intrusion detection is hybrid-based. This method combines two or more techniques to achieve the advantages of one method [7].
When it comes to analyzing the data collected within a network, an IDS should take into account the nature of the information. Big data is a vast amount of data that can be gathered and stored in the network. Due to the increasing number of computers and the continuous changes in the distribution of network data, it is very challenging to identify the abnormal behaviors in the data. The various formats and structures of network data make it easy to identify its nature [8]. Big data is also beneficial for analyzing network patterns since it can provide insight into the activities of the network. Due to the complexity of the data, it is very important that an IDS uses techniques such as big data to find out what is happening in the network. Unfortunately, most of the studies on the use of big data in analyzing network data have not examined the complexity of the data [9]. When it comes to analyzing the data of a network, an IDS should take into account the nature of the information. This is because the data collected and stored in a network can be very complex.
Many researchers are working on developing machine learning algorithms that can reduce the false positive rates and improve the accuracy of IDS. Unfortunately, implementing these techniques in the IDS takes a long time due to the complexity of the data. With the help of Big Data techniques, machine learning can be used to solve various computational time and speed issues. Due to the lack of deep learning frameworks that can work with big data analytics tools, the development of an in-depth learning framework for Apache Spark was very important [10]. This was done through a number of initiatives. This paper aims to introduce the various techniques that are used in the development of Spark Big Data systems for IDS. These techniques can help reduce the time it takes to perform the classification process. They can also reduce the computational time involved in developing the technology. In this research, we will introduce a new IDS classification method called
Literature survey
Proposed intrusion detection based on DBN-SVM-GBR
When it comes to accessing a computer’s network, hackers consider the status of their network. This information could be used to carry out harmful attacks. On the other hand, networks and hosts have to deal with a huge amount of data. This is why it is important that they keep track of all the details that they collect. It is also important that hosts and networks implement effective measures to prevent unauthorized access. Big data techniques can also be utilized to develop effective systems for detecting intrusions. One of the crucial factors that must be considered is the availability of such information. In this study, we present a framework that combines the optimization and deep learning capabilities. The main advantage of this system is its ability to analyze the hidden layers of a network. In addition, it can improve the structure of a network by implementing a PSO-GBR and SVM algorithm. Big data techniques could also help in reducing the false alarms that the system produces. Deep learning systems can also benefit from this approach as it can improve their performance. Currently, these systems have a low throughput due to their high complexity. With the help of big data, they can achieve a faster execution rate.

Spark-DBN-SVM-GBR architecture for Intrusion Detection.
This paper presents a design-based method that can help reduce the false alarms that a system produces. The algorithm can also detect complex features and relationships between attacks. Through the use of big data techniques and the Spark processing engine, we were able to increase the speed of model training and data processing. The BigDL library was utilized to improve the performance capabilities of deep learning systems. Due to the complexity of deep learning models, they have low execution speed. This allows us to quickly detect attacks.
The KDD99 dataset, which is over 19 years old, is still widely used by academic researchers. The training data was processed and approximately five million connection records were created. There are 41 features in the dataset, and it has been divided into 22 attack categories. Denial of service attack (DoS): The goal of a DoS attack is to prevent a network or machine from being accessed by its intended users. It usually involves flooding a targeted system with requests to prevent it from processing traffic. User to root attack (U2 R): An exploit is a type of attack where hackers take over a system and provide themselves with unauthorized access to the user. Remote to local attack (R2 L): An attacker can take over a computer or system after finding a vulnerability in a security software..
The primary goal of this attack is to steal or obtain data illegally, infect the victim with viruses, or cause damage. Probing Attack: The initial step in an attack is referred to as scanning or discovery. It involves gathering information about a targeted system. During this process, a probe is performed to identify known vulnerabilities in software or in the custom code used for the targeted application.
Proposed Intrusion detection DBN model along with PSO-SVM-GBR
A network is equipped with an intrusion detection system to identify and prevent unauthorized access to a computer. This technology can also help prevent data from leaking or damaging. Due to its effectiveness, it has attracted increasing interest from both the private and public sectors. This type of technology can help improve the security of an organization’s information infrastructure.
In 2006, Hinton and colleagues [17] proposed the notion of deep learning. Deep neural networks are more complex and provide a better understanding of complex problems. They emphasize the importance of having a deep network’s multi-layer structure and features. On the other hand, shallow neural networks are often limited in their classification abilities. The original data is maintained even as the dimensions of the information change. Deep learning models are commonly used for analyzing complex problems, such as CNN and RBM.
The hierarchical structure of the Deep Learning algorithm’s DBN model makes it easier to process the collected data. The training phase involves training the various RBM networks and then training the weakest one on raw data. The lower layer’s output features are then used as input for the training process. The values of each layer can be acquired by training the RBM framework’s bottom-up approach. This method is repeated repeatedly with the goal of learning the various features of the data.
DBN models undergo a two-phase training process. The first phase is called the pre-training, while the second phase is known as the fine tuning phase. The first stage involves the use of the unsupervised algorithm to train the various layers of RBM. In the diagram below, the algorithm flow shows the steps involved in the training of each layer. The last layer of the RBM network is trained, which produces an output as a feature of the algorithm. The BP method ensures that errors are adjusted and sent back to the correct place. If the training goal isn’t met, the network will be retrained. The training duration of networks is also increased if they’re continuously trained. The goal is to perform network-level hierarchical learning.
The RBM is a two-layered model that has no self-feedback and symmetrical connections. In the model’s iterative process, the visible layer represents the observed data, while the hidden layer refers to the data that can be considered as a feature extraction. For convenience, it is assumed that the hidden and visible units are binary variables.
The state of the hidden unit is represented by hj while the visible unit is represented by vi.
The RBM’s precise iteration procedure is as follows: In the visible layer v1 of the network, take data and create the bias and weight values of the network. The weight of the hidden layer neuron will be used to determine the probability of its existence. At this point, the value of the second neuron in the visible layer is considered to be unknown. It can be reconstructed or calculated by comparing the value of the neurons with the value of the first neuron. The bias and weight should be adjusted according to the difference between the two. Follow the given algorithm for solving h2 and take v2 as given. To obtain the error signal, calculate the weight. v3 and v2 again. An increase in the number of iterations leads to a state of equilibrium, which will eventually cause the whole system to congregate.
The iterative process of training RBM is shown in the following steps. Each iteration updates the weights and offset values. After the training layer has been trained, the last RBM output is extracted and used as a feature. The goal of the program is to provide the best possible result.
When implementing the Gibbs algorithm for training RBM, there will be several sampling steps. This issue can occur due to the high data dimension. A CD algorithm is utilized to train the system, which segments it into several layers. This method then fine-tunes the parameters of the training.
In the second phase of the DBN training program, the model is fine-tuned. One of the most important factors that can be considered when it comes to achieving the weight of the RBM feature is the mapping of its feature vector. The BP network is also responsible for carrying out the error transmission. The DBN network’s error function indicates that the system has changed its weights and biases. This method can affect its efforts to achieve the ideal state. The training algorithm for this project aims to modify the network’s parameters to achieve the best result.
The number of hidden layers, number of nodes, and model iterations determine the capacity of the DB N framework to convey data attributes. Although it can do so, the parameters related its classification are not widely known. The training data set’s optimal number of hidden layers and nodes is determined by several experiments. One of the most important factors that determines the number of nodes needed for a given classification process is the number of hidden layers. A batch training method can be used to improve the efficiency of the process by randomly sampling the collected data set. The training data is then randomly selected and trained, and the network weight is continuously updated.
Large amounts of network data are required to perform well in machine learning.On the other hand, standard models can train well with small and high-dimensional datasets. A more efficient model is the SVM, which has better classification accuracy and faster running speed. The PSO algorithm is widely used in the selection of the SVM’s parameters. It can be used to analyze the collected data from the DBN platform and find the most suitable combination of parameters. The classifier trains the SVM and performs the final analysis to compare the results with those of other models.
The partition and optimization methods utilized in the PSO-SVM frameworks are designed to solve original problems while maintaining their characteristics. To avoid the emergence of random distribution of particles in the initial swarm, the algorithm’s initial position is adjusted. To optimize a given domain, divide it into equal parts and distribute the particles evenly across each subdomain. Then, through the SVM algorithm, solve the problem iteratively and get the best result.
This study aims to analyze the potential of deep learning to be used to attack various computer models. Besides this, it shows that attackers can also modify the operation of the proposed neural network by selecting its parameters, which makes it possible to alter its output. This will help in developing a better understanding of how these attacks can affect the real world. Since the design and structure of neural networks are intricate, it is crucial for researchers. Although other examples have been presented, the majority focus on these kinds of networks.
Analyzing large datasets using a sequential approach avoids breaking the neural network. Instead, GBR adds small trees with high bias to the dataset in order to focus on the document that caused the error. It performs gradient descent on an instance of space X1.
Learning rates are computed by taking into consideration the number of times a feature has been exposed to it. The value of the negative gradient is then calculated by the prediction of regression trees.
The GBR algorithm takes into account the zero square loss condition in a given document and transforms it into its residual from its previous iteration. It is based on the standard CART algorithm. Another important factor that an algorithm should take into account when it comes to developing its algorithm is its learning rate, as it determines how many iterations it needs to complete the task.
There are three initial assumptions that are aimed at ensuring that deep learning models perform well. These include the ability to layer-by-layer process, feature transformation, and the sufficient complexity model. Deep learning techniques are typically processed by several layers. Since they lack the necessary features to make them more complex, they tend to be relatively easy to implement. The ensemble method, on the other hand, can make them more complex. This paper aims to analyze the advantages of implementing cascading structures and feature segmentation in deep learning. One of the main advantages of this approach is that it can improve the performance of the support vector machine. Another hypothesis suggests that the implementation of Spark-PSO could reduce the time it takes to detect and train models. The experiment involved converting character attributes into training models for different deep learning approaches. The results of the study can be found in the following sections.
Result and discussion
Dataset description
The data sets are from the KDD CUP 99 benchmark, which was used by the AIPRS to analyze intrusion detection systems in 1999. The 494,021 records in the dataset represent various classifications attributes and numerical attributes. The benchmark’s four attack types are Denial of Service, Probe, Remote to Root, and User to Root.
Algorithm implementation
The process of implementing the algorithm involves pre-processing 10% of the training data and correcting the remaining KDD CUP 99 test data. After reducing the feature dimension, the proposed model trains the data according to the reduced feature dimension. The parameters of the training model are then adjusted to get the optimal ones. The test result is then verified and the algorithm is released. In network training, insufficient training can occur if the training number is too small. Overfitting can also occur if the training number is too large. The training number was initially set at 200, and after a hundred training iterations, the cost of the network decreased flat. The number of iterations was set at 100. The learning rate, initialization threshold, and gradient boost regression algorithm were also set.
Simulation results
Due to the importance of scientific papers in today’s society, organizations have to implement suitable security measures. An intrusion detection solution is utilized to identify and monitor anomalous activities. This system can be divided into two components: the control and surveillance domains. The latter is responsible for monitoring corporate networks and various applications. Figure 2 to 11 shows the implementation results of proposed framework.

Model architecture for Proposed DBN.
In this study, we present a deep learning model called Spark DBN-PSO, which is designed to classify various types of data. The architecture of the model is shown in Fig. 2. After fine-tuning the model, five RBMs are composed of (49, 128), (128, 256), (256, 128), (128, 128), and (128, 64) respectively. The outputs from these RBMs are connected to a cluster of 6 nodes, which is used for multi-class classification. Table 1.
Model design and parameters
The table above shows the architecture of the DBN that is implemented using five RBMs. After fine-tuning the system, each of these is equipped with a set of hidden and visible nodes. The output of the last RBM is then connected to a layer that has 6 nodes for multi-classification using SVM function instead of softmax.
Precision
Precision is a metric for how much of the test data that is flagged as an attack is truly from one of the attack types.
Where TP represents the true positive value, FP indicates the false positive.
The percentage of attack classes accurately detected is measured by recall.
Where TP indicates the true positive value and FN indicates the false negative
The F-measure is a test accuracy metric that assesses the balance between precision and recall.
Where P represents the precision and R denotes the Recall value
The ratio of accurately classified botnet attacks to the total number of botnet attacks is known as accuracy.
Where I c B indicates the correctly identified botnet attack, TB denotes the total number of botnet attack. Table 2 shows the comparison analysis of different frameworks among different evaluation parameters.
Comparison Analysis with different Parameters
Table 3 shows the specific types of attacks, the data set contains 41-dimension labels and 1 dimension labels. It features four attack modes: DOS, R2 L, U2 R, and probe. The training set has 21 attack categories, while the test set has 18 that were not included in it. These new intrusion attacks can be used to test the algorithm’s ability to detect unauthorized access.
Specific types of attacks
In order to verify the performance of different data sets, we did comparative experiments on NSL-KDD data set and KDD Cup 99 data set (Table 2). However, since these two data sets are too large, we randomly selected 5,000 pieces of data. Among them, 70% of the training data and 30% of the test data were used to compare the testing time, Accuracy of the two data sets in the different algorithms, as shown in Table 2. there is not much difference in testing time between the two data sets in the three different algorithms (all were around 50 s). Compared with different algorithm and proposed algorithm has the highest average Accuracy, which confirms the good performance of our proposed algorithm.
Intrusion detection is an important aspect of medical security. The use of deep learning techniques is very effective in detecting unauthorized access to medical data. Through the use of unsupervised learning and supervised learning, the Spark-DBN-SVM-GBR can effectively perform intrusion detection tasks on large, complex, and nonlinear data sets. The ability of the proposed technique to extract high-dimensional feature vectors and classify them efficiently makes it an ideal tool for medical security. This can be utilized to enhance the network topology by reducing the number of hidden nodes. The results of the study show that the PSO-DBN algorithm can achieve a good accuracy rate, which is higher than the accuracy of other deep learning techniques such as. It can perform various tasks such as extracting high-dimensional feature vectors and performing intrusion detection. In the future, we will investigate the performance improvement of intrusion detection using the homogeneity metric, and look into the use of feature selection schemes that are more suitable for the environment. We also plan on analyzing the performance of distributed Spark processing with varying cluster counts.
