Abstract
Artificial intelligence methods have often been applied to carry out specific functions or errands in the cyber-defense realm. However, as adversary methods become more complex and difficult to divine, piecemeal efforts to understand cyber-attacks, and malware-based attacks in particular, are not providing sufficient means for malware analysts to understand the past, present and future distinctiveness of malware. Because, most of the malware communications take place-utilizing services. These services are completely anonymous and monitoring such services is a hard task. To address this issue, this paper proposes a novel traffic analysis scheme using correlation methods (non-parametric approach). Experiments are performed to validate the proposed approach on the real time traffic data collected over the period of 1 week. The experimental results confirm that the proposed method outperforms the existing state of the art traffic analysis schemes. The result also exhibits the traffic classification performance, which is analyzed by the decade old nearest neighbor method.

Introduction
Malware traffic may be of any kind where the functionality of a system changes completely. Sensitive traffic that deals of quality services i.e. gaming, surfing and social media data. These undesired traffic deals spam reports which are created by worm. In general malware is malicious software which affects the system to which it is injected. Trojan horses, ransomware, spyware are prime examples of them. It is used by both private parties and government for specific purposes like stealing the user data officially and privately. Infected systems are used for email spam, child pornography. Malware can also enter a system in the backdoor which collapses all authentication procedures when the system is compromised then many number of backdoors can be installed which can provide access which is invisible to the user. The result confirms that this type of malware can cause unnecessary traffic to a system. Malwares are even stronger in the modern world where it escapes anti-malware and even to the IT Engineers. A Cent percent protection to any malware cannot be provided. As the research goes on the other hand we are analyzing about the traffic on which the advanced malware can be predicted. Grayware is an application which is used to identify unwanted application or malfunctioning unit in a system. Many applications which comes with a description of protecting the user system from attacks. But it will not provide any support while the attack is from advanced malwares. The only way to prevent attacks is to use any application without Internet, as that is not possible in the current generation of internet world, every information about a person matters more in the cruel world as some are not afraid of their personal information. This paper gives a awareness to the people about malware. Everybody is leaving some hints about their psychological change in their updates as this can be used against them for some psych reasons based on the person who injected a malware.A novel traffic analysis system is proposed to detect malware based on malicious traffic. The proposed system is capable to detect both anomalies and intrusionsbased on the traffic. The traffic analysis system is modeled with Carl Pearson correlation method in order to increase the detection accuracy. In essence, the profiling is used to profile the host and network features. The performance of the proposed traffic analysis system is evaluated in real time test bed and also considers an additional number of metrics which were not previously used in [4]. The proposed traffic analysis system is modeled in order to reduce the false positive rates and can be used with any security systems. The organization of this paper is as follows that Section 2 provides useful literature about the malwares and their classification techniques, Section 3 gaves a clear view about proposed methodology. Section 4 describes about experimentation, Section 5 follows with the result and performance analysis and Section 5 concludes with impact of results and the future work.
Literature survey
Yi et al. [1] presented a machine learning based approach to detect domain generation attacks in a network. Traditional techniques like blacklisting, port scanning are insufficient to combat against domain generation attacks. They monitored and processed a threat data for over a period of one year. The proposed model consistsof two levels one for classifying DGA domains from a normal domain and clustering algorithm is used to generate DGA domains. The prediction model is based on time series which they used HMM. They achieve an accuracy of 95.5% for classification and 97% for prediction. Hsu et al. [2] proposed a hybrid method to identify Fast flux domain detection. Fast flux service is used as proxy for phishing websites. They used reverse DNS traffic to detect a Fast flux service by combined real time and long term monitoring. The results achieved through this method significantly higher accuracy than state of art algorithms used for Fast flux service detection method. The performance evaluation is carried out in their universities lab that successfully identified the Fast flux service by proposed method.
Sajeev and Nair [3] presented a flow based classifier to classify Peer to Peer network and Non Peer to Peer network. For malware detection and network traffic management P2P network classification methods are essential. Existing P2P classifications are based on port based, signature based, Pattern based and statistical methods. The proposed classification method is flow based which classifies whether internet traffic is P2P or Non P2P.The achieved results through proposed method is better than traditional methods. Stevanovic and Pedersen [4] presented a traffic classification method for detection of botnets. They proposed three traffic classification methods based on random forest based classification method. The performance is evaluated through used a sample of 40 botnets and malicious applications. The experimental results proved the proposed method is far better than traditional methods for classification of botnets traffic.
Lashkari et al. [5] demonstrated an android malware detection method using machine learning classification methods. Existing android malware detection methods are high detection rate and fast prediction on fixed dataset. Unfortunately most of the fixed datasets are not suitable for real world problems. So, they created their own dataset called CAandMal2017. It overcomes all the disadvantages over the previous fixed datasets used for android malware detection methods. They offer 80 features for traffic classification for detection. The experimental results had 85% as precision and 87% as recall for classifiers they used. Niu et al. [6] presented a Heuristic Statistical based approach to detect encrypted traffic which is used by malwares with strong concealment. Existing methods for combating encrypted traffic by malwares are statistical and machine learning approaches. It has their own shortcomings for identifying encrypted network traffic. But the proposed method is combination of both statistical and machine learning method called HST. They extract small payloads for using four randomness tests to machine learning to improve performance. They also presented a HST-R which is simple handshaking method to increase the accuracy of classification method. They perform an experiment through testing dataset with various datasets consist of various traffic patterns and cryptographic encrypted traffic protocols The results showed that HST-R outperforms other traditional entropy based, statistical based, coding based and ML based approaches. They also perform comparison between different algorithms in which simple handshaking method had higher accuracy for secure socket layer and shell layer.
AlAhmadi and Martinovic [7] presented a MalClassifier an novel privacy preserving system for automatic malware analysis and classification of malwares using network flow mining. It identifies malware family and malicious activity over the network. It observes the sequence flows and behaviors of malwares. Through mining it extracts the network flows and generates behavioral profiles. These profiles are used as features or input for classifiers for malware classification. The performance evaluation is carried out through traffic of ransom wares which classify with 96% of F measure for family classification. Jiang et al. [8] demonstrated the comprehensive behavioral model for profiling the new sample malwares. They also proposed malware classification method using this behavioral profiling of samples. The results achieved are compared with state of art classification algorithms and provides better accuracy.
Lim et al. [9] presented a malware traffic classification method by using similarity of network activity. Many malware needs network activity for communicate over command control server and payload for signature matching. Many malwares arerelying on URL for request communication. To classification of malware, network activity and changes in behavioral of network to be observed. The important factor for analyzing the behavioral pattern of malwares is sequence of flows and correlations. They analyze the sequence of flows generated by malwares and presented a classification method based on similarity of network activity. Yeo et al. [10] suggested an malware traffic classification system which automatically detects the malwares. The suggested method using Convolution Neural Network and other machine learning algorithms. The proposed method using 35 features extracted from packet flow. Dataset was used from stratosphere IPS project which have normal malware packets. The results obtained through proposed method have better accuracy over the traditional classification algorithms.
Rahul et al. [11] explored a Deep Learning based classification on malware different datasets for benchmarking a desirable model datasets over years. The proposed model is used to built an effective hybrid IDS to combat against a future cyber attacks. The malwares behavioral pattern is dynamically changing, so research community needs an dataset for build IDS. Benchmarking is carried out for different datasets with Deep Learning environment provided. Each datasets were runs 1000 epochs. The parameters were extracted over a different datasets. This type of study facilitates to discover the best algorithmwhich can successfully work in detecting future cyberattacks. The features are extracted through deep learning model by passing high dimensional data to many hidden layers finally, we propose a highly scalable and hybrid DNNs framework called scale-hybrid-IDS-AlertNet whichcan be used in real-time to effectively monitor the network traffic and host-level events to proactively alertpossible cyberattacks.
Proposed method – Correlation based method
The logs logged in the server are taken into consideration and the features listed in the dataset MCFP are extracted using power shell agent running on the connected hosts. The extracted features are sent to the analysis module and correlation probabilities are estimated for all the features. Based on the deviation in the feature probabilities the profiles are created. In addition, the deviation threshold for each profile is estimated using Pearson Correlation.
Where
Case 1: A sample profile generation
Percentage difference and correlation between the Records A and B
Percentage difference and correlation between the Records A and B
Consider the two records (Normal (A) and Attack (B)) with six features as mentioned in Table 2 and a sample computation on the profile categorization is given. The Correlation used in the scheme (Carl Pearson) for the records (A,B) is 0.159199126453477, which is weakly correlated and the bandwidth h for Profile A is 156.48 and profile B is 81.3 which confirms that the Record B belongs to the category profile 1. Likewise, all the profiles are generated and categorized based on the correlativity. In addition the deviation percentage is also estimated for each profile features in order to monitor the significant changes in features. Table 1 shows the stats for the various features such as DNS, SPAM, UDP flow, TCP flow, HTTPS flow, Frame rate.
The proposed model is experimentally tested in real time using a secured sandbox called Cuckoo. The sandbox is modified in order to tap all the traffic generated from the machine. All the collected traffic are then stored in the form of pcap file for further analysis and research. Three additional machine with separate NAT configured is connected to the sandbox so that all the machines form a network.
Data collection
Machine 1, machines 2 and 3 are the three machines which run on different operating system is used for data collection. Out of the entire machine, the sandbox machine acts as a promiscuous gateway for monitoring and redirecting all the traffic. The experimentation is carried out in both the networks, wired and wireless.
Client configuration
All the client machines are configured with 4 GB RAM, Intel i5 processor and Atheros QCNFA335 WLAN Wi-Fi Bluetooth4.0 NGFF Wireless Card (for wireless) and Realtek EthernetPCIe Gigabit (for wired). All the traffic are redirected to the sandbox machine for further analysis and monitoring. Sandbox machine emulates the entire infrastructure to exhibit the malware that it executes in the physical environment. This is due to the sophistication incorporated in the malware such as execution environment identification.
Security operation centre (SoC) configuration
A dedicated server machine is configured with 20 GB RAM, Intel Xeon processor and Elastic search stack to log all the logs. Different agents from the beats are installed in the sandbox for the deep logging and analysis. The server machine is also used for data collection and act as SoC for monitoring the traffic. SoC is configured with kibana to visualize all the real time events in the network and hosts.
Sandbox configuration
Beats agent such as metric beat, audit beat, heartbeat, and packet beat etc., are installed in the sandbox machine. Further, application level software isolation program called sandboxie is also installed in the sandbox. Further, the network adapters are configured to set in the separate NAT so that the variants such as worms, which propagate through the network, can be restricted.
Dataset
Current IDS data set like KDD 99, DARPA series, etc., only holds information about the attack, which are old and conventional. Hence, a small python script is executed to automate the agents and to scan the network and Tshark utility is used to collect all the traffic trace. The values are stored in the format of comma-separated value (.csv) with the set of class labels “normal” and “abnormal”.
Results
The response deliberated for the features like DNS, SPAM, UDP flow, TCP flow, HTTPS flow, Frame rateare plotted in the Fig. 1. From the Figs 1–3, it is clearly shown that the fluctuation in the above stated features are very higher during attack scenario than the normal. The blue spline in the Fig. 1 shows the normal traffic whereas red spline shows the detected attack traffic.
Traffic classification – UDP flow (Normal vs abnormal traffic).
Further, the traffic classification was also carried out in the client side (a laptop with minimal configuration such as 2 GB RAM, Core i3 processor, configured with Ubuntu 14.04 LTS and preinstalled with proposed traffic analysis system) and the results were plotted in the Fig. 2.
Performance comparison – Proposed correlation scheme vs Entropy schemes
Performance comparison – Proposed correlation scheme vs Cross Entropy schemes
Traffic classification – TCP flow (Normal vs abnormal traffic).
From the traffic classification results, it was confirmed that the proposed traffic analysis system outperforms the state of the art techniques [1, 2, 3] with the better accuracy in attack traffic classification.
The performance analysis of the proposed scheme is experimentally tested in real time test bed with the state of the art techniques [1, 2, 3]. All the schemes used in the test bed is trained in their training procedure and configured accordingly. A real time test data collected in the sandbox is given as the input to the proposed scheme and other state of the art techniques. Furthermore, few malware samples of ransomware variants are downloaded from malware bytes and executed inside the sandbox. This traffic is also collected and given as the input to the proposed scheme. The proposed scheme is used alongside with snort a real time NIDPS in order to detect and prevent the malware.
From the experimental results as given in the Tables 1 and 2, it is confirmed that the detection accuracy measures of all the collected traffic results in near 100% accuracy in detection with less False Alarm Rate (FAR). In essence, the performance of the proposed traffic analysis model is compared with the state of the art model such as Entropy based schemes [1, 2, 3]. From the experimental results as presented in the Tables 2 and 3 along with comparison results as shown in the Figure, it is confirmed that the proposed traffic analysis model is better than the state of the art traffic classification techniques. The reduced FAR rate is also an added advantage for the proposed scheme.
Conclusion
Traffic classification – DNS flow (Normal vs abnormal traffic).
The proposed traffic analysis system is robust enough and detection accuracy is high when compared to the existing methods. From the experimental results, the proposed traffic analysis system is robust and better when compared to state of the art techniques in terms of accuracy and FAR. The proposed traffic analysis system can be used along with any security systems without compromising the performance of the network. From the experimental validation, the proposed traffic analysis system significantly outperforms the state of the art traffic analysis techniques.
