A multi-phased statistical learning based classification for network traffic

Abstract

Application Traffic Identification is an imperative device for sorting out the system as it is the most popular approach to distinguish and characterize the network traffic created from different applications. The classification using conventional Port-based and Payload-based techniques has become counterproductive due to inconsistencies. However, in recent times, approaches with machine learning and statistical techniques have guaranteed higher accuracy. However, learning techniques are inadequate for solving problems with Time and Memory intricacies in vast datasets. Hence, the proposed paper presents a novel scheme of Statistical based traffic classification named Multi-Phased Statistical Based Classification methodology that renders Semi-supervised machines with advanced K-medoid clustering and C5.0 Classification algorithm. The proposed system displays a classic competence in observing the known and unknown application flows by statistical features utilization scheme that enhances the classification preciseness. Further, the trial results show that the proposed work outperforms previous approaches by achieving a higher granularity of 98–99% and reducing complexities. Ultimately, the new proposed work is evaluated on our campus traffic traces (AU-IDS). It is proven that the proposed approach accomplishes a higher exactness rate and thus encourages its implementation in real-time.

Keywords

Communication networks machine learning clustering methods semi supervised learning statistical learning

1 Introduction

With the rise of the World Wide Web, the utilization of the Internet and the network traffic generated by them is growing exponentially. It is evolving as a hot area of research because of the newfangled encryption strategies and burrowing rising step by step, making traffic undetectable. This has led to a dire need to classify the traffic as it aids the network manager to impose various security policies and properly rank applications across the limited bandwidth. Network traffic classification is the initial phase, which is followed by monitoring and control in network management. It is done to enhance the network performance [1, 2]. Over the previous decades, Port-based techniques were used in classification,which neglected to get through the dynamic idea of ports [3 –5]. This was superseded by a Payload based technique that suffered from heavy overload and hence fell short in classifying the encrypted traffic [6, 7]. As of late, dissecting the statistical boundaries, network flow has picked up significance because of its raised characterization exactness, adaptability, and light-weight. The statistical boundaries incorporate packet size measurements (Min/Max packet length), inter-arrival time, flow duration, bytes transferred, byte counts, and the number of packets, and so forth. [15, 16]. At first, traffic order dependent on flow measurements depended on unsupervised or supervised machine learning algorithms. The unsupervised algorithm operates by grouping the data samples with similar behavior into clusters, with no prior knowledge. However, mapping of numerous clusters for a few applications [17].

With no prior knowledge is a problematic task [18, 19]. The Supervised algorithm trains the classifier by examining the set of labeled training samples and classifying the traffic based on pre-classified classes built in the training phase [15, 16].

The multi-level cluster analysis (unsupervised), when used in conjunction with compound classification (supervised), promises higher accuracy and completeness in application identification [49 –51]. To expand the stream characterization precision, fell grouping procedures utilizing a blend of calculations just as semisupervised ML approaches have been recently investigated.Utilizing various classifiers and choosing the ideal decision for characterizing each traffic flow through casting a ballot or, in any event, joining the outcomes for the last decision, in any case, doesn’t explicitly consider refining the ground-truth information to completely represent the different stream classes (per application) and their ensuing identification. Moreover, blending various occurrences of classifiers raises versatility issues concerning their ongoing execution.

The current traffic arrangement techniques experience the ill effects of lackluster showing in the essential circumstance where supervised data is deficient, and significant obscure flows are available. Especially, increasingly new/obscure applications are developing in the distributed computing based climate. Robust traffic characterization is a significant test in reality complex organizations. For example, as new applications rapidly build, we can gather and examine an uncompleted preparing informational index. Additionally, if the developing applications are scrambled, it is practically challenging to investigate adequate preparing tests through profound examination in a restricted time. These perceptions become the inspiration for our work.With the rapid evolution of modern internet technologies, new applications came into existence [26, 28]. This resulted in poor performance of traffic classification using the above approaches. In this paper, we mean to handle the issue of obscure flows in a semi-supervised methodology [53 –55]. This work considers not many named preparing tests and examines flow connection in reality network climate, making it distinctive to past works.

Therefore, to address these challenges, we have achieved a semi-supervised methodology that initially clusters the data samples to make the best predictions followed by feeding this into classifier as training data to classify the new amorphous data efficiently.

The principal commitments of our work are as per the following:

The Offline Phase imposes advanced k-Medoid clustering that adds one medoid at a time and utilizes each traffic flow as an optimal cluster candidate for the k-th medoid. The representative objects known as Medoids are utilized to follow the most halfway situated object in a cluster. An inventive capacity is characterized to locate the following new ideal medoid at every iterative advance. Once the cluster formation is determined, the models are constructed for each application traffic flow. Furthermore, these pre-labeled application models are retained as prior knowledge for online Classification. The advanced clustering algorithm considerably excels in reducing the computational effort for large datasets.

In online Classification, the features are extracted, and flows (Session) are grouped based on 5-Tuple information. The clustered flows are fed to the Advanced C5.0 classifier to attain the maximum accuracy of Classification. If the classifier fails to identify the new applications (unknown), they can be recognized by 5-Tuple information, which provides more accuracy.

Besides, tests were completed to evaluate the proposed work’s exhibition in correlation with the other five distinctive machine learning algorithms that demonstrate that our work achieves higher accuracy. The residue of this manuscript is composed as follows. Section 2 elaborates on the current foundation work of traffic characterization. A new technique for application classification is briefed in Section 3. It is supplemented with test results and discussions about the suggested approach’s functioning and the existing ML algorithms of Section 4. Section 5 discusses the conclusion and future extension. Table 1 presents the applications involved in our work.

Table 1
Traffic Collection in the Web Backbone

SI. NO Traffic group Traffic identified Transport layer protocol utilized

1 Chat Messenger (MSN, Yahoo) TCP

2 Bulk Traffic Database, FTP, AFTP TCP

3 Web Traffic Google, Yahoo, Bing, Ask, Facebook, Twitter, Amazon, Flipkart, eBay, LinkedIn, Google+, Blogspot TCP

4 Mail Traffic SMTP, POP3, NNTP, IMAP TCP

5 P2P traffic BitTorrent, eDonkey, Gnutella TCP/UDP

6 Attack IP address scans, Port scans TCP/UDP

7 Streaming Traffic QuickTime, Streaming, Games TCP/UDP

8 Encrypted Traffic SSL, OpenVPN, SSH, TOR TCP/UDP

9 VoIP Skype, Viber TCP/UDP

10 Others P2P, Internet TV TCP/UDP

SI. NO	Traffic group	Traffic identified	Transport layer protocol utilized
1	Chat	Messenger (MSN, Yahoo)	TCP
2	Bulk Traffic	Database, FTP, AFTP	TCP
3	Web Traffic	Google, Yahoo, Bing, Ask, Facebook, Twitter, Amazon, Flipkart, eBay, LinkedIn, Google+, Blogspot	TCP
4	Mail Traffic	SMTP, POP3, NNTP, IMAP	TCP
5	P2P traffic	BitTorrent, eDonkey, Gnutella	TCP/UDP
6	Attack	IP address scans, Port scans	TCP/UDP
7	Streaming Traffic	QuickTime, Streaming, Games	TCP/UDP
8	Encrypted Traffic	SSL, OpenVPN, SSH, TOR	TCP/UDP
9	VoIP	Skype, Viber	TCP/UDP
10	Others	P2P, Internet TV	TCP/UDP

2 Related work

A few strategies have been proposed to classify network traffic. The research community’ s focus has exclusively moved towards complete surveys on traffic classification [15, 16], which indicates the interest in this domain. The accompanying subsection presents the cutting edge approaches in the sphere of traffic characterization.

2.1 Port-based traffic classification

The proper method for organizing traffic grouping utilized port numbers to identify the network application [3]. Though many applications use well-known ports defined by IANA, all applications need not have registered port numbers. Hence, this approach’s accuracy has seriously declined due to the evolution of applications like P2P, online gaming applications that don’ t have any registered port numbers but use random port numbers, thus categorizing using ports impossible. Simultaneously, port-based arrangement strategies are presently viewed as outdated, given the regular jumbling methods and dynamic scope of ports utilized by applications.

2.2 Payload-based traffic classification

It is an alternative strategy for the port-based procedure that examines the packet payload signatures to identify the network traffic [6 –9]. It is implemented using Deep Packet Inspection (DPI) which promises high accuracy. However, it requires continuous updating of the signature with every update of the application, and it does not work well with encrypted traffic.

2.3 User-behavior based traffic classification

User Behavioral techniques are profoundly encouraging and give a lot of order precision with diminished overhead contrasted with payload review strategies. In any case, user behavioural techniques centre around end-point action and require boundaries from various streams to be gathered and investigated before effective application recognizable proof.The user-Behavior based technique analyses the network traffic based on host behavior. It works on three main levels: 1) Social level 2) Functional Level 3) Application Level to recognize the host’s conduct [10 –12]. This method partners have conduct set up with at least one traffic applications and improves the meaning by heuristics and conduct separation. However, they require numerous flows to classify the traffic accurately, and they cannot handle traffic in different groups with similar behaviour.

2.4 IP-level traffic classification

Following the idea of port-based technique, the IP level traffic classification approach uses the evidence delivered by notable IP addresses to group the system traffic. The IP level classification technique studies the well-known IP addresses belonging to the popular internet application in an offline classification phase [13].

2.5 Service-based technique

It [14] studies the triplet features <IP Address, Port Number, protocol>of the traffic flows and equates them with labelled data set traffic flows in offline mode to classify the network traffic. However, a periodical update is required to maintain high accuracy.

2.6 Statistical flow feature-based approach

The statistical flow-based approach has gained momentum owing to its efficiency and scalability in classification. It exploits the traffic flow fingerprints to categorize and determine the classification model through data mining algorithms to classify individual traffic applications. Statistical flow-based methodologies misuse application variety and natural traffic impressions (stream boundaries) to portray traffic and determine order benchmarks through data mining procedures to recognize singular applications. Initially, it extracts the statistical features from a labelled dataset and generates different structures (e.g., Clusters, Classifiers, Decision Tree) as the output used to classify the new traffic flows. Ordinary statistical flow level characterization can be additionally partitioned dependent on the sort of Machine Learning calculation being utilized, that is, supervised or unsupervised.

2.6.1 Unsupervised machine learning approach

Unsupervised ML performs clustering that detects the hidden structures or patterns in applications. They categorize the unlabeled traffic records and construct clusters to learn more about their similar and dissimilar behaviours. Zander et al. [17] suggested the Auto Class grouping method for network traffic arrangement. L.Yingqiu et al. [18] evaluated the traffic flows by statistical features for traffic identification. McGregor et al. [19] suggested the Expectation-Maximization (EM) calculation to bunch the traffic flows into a few gatherings and assigned the application name to traffic groups by hand. Bernaille et al. [20] and J.Erman et al. [25] proposed the K-Means bunching calculation for gathering the traffic groups and doling out the application name payload investigation device. J.Erman et al. [21] proposed the DBSCAN grouping calculation, and D.Liu et al. [22] applied the Fuzzy C-Means bunching calculation for application recognizable proof. Y.Wang et al. [23] proposed statistical features for flow clustering and matching equivalent clusters with applications by the payload signature method. Finamore et al. [24] proposed to absorb the statistical element bunching and payload measurable component grouping. These approaches uncovered the exciting patterns hidden in the traffic.

2.6.2 Supervised machine learning approach

Supervised ML requires trained data samples to build the traffic classifier. Moore et al. [26] suggested a Naive Bayes calculation to examine the traffic for recognizing the application names for testing data sets of pre-marked preparing information tests. Kim et al. [27] and Este et al. [32] suggested the support vector machine (SVM) calculation for naming the network traffic application dependent on flow-based statistical highlights. Auld et al. [28] proposed a Bayesian Neural Network that used several statistical flow features for network traffic classification. Haffner et al. [29] proposed a machine learning technique for automated application signatures that combined different statistical characteristics. Bernaille et al. [30] proposed a supervised classification scheme by taking the first few statistical features packets. Bonfiglio et al. [31] suggested Naive Bayes and Pearson’s Chi-Square Test organize traffic arrangements. Crotti et al. [33] proposed Probability Density work (PDF) based convention fingerprints, and Valenti et al. [34] utilized Small time-Windows-based parcel mean allocating application names to the system traffic. J.But et al. [35] proposed five machine learning algorithms to construct the arrangement model and examine its exhibition. The offline phase of the traffic classification scheme is used to pre-process and generate training data for online classifiers as test cases in Chun-Nan Lu et al. [36]. T. Bujlow et al. [37] suggested the C4.5 managed calculation that utilizes a show, contents to relate packets, and flow records with application use. The supervised Machine learning approach for known applications outperforms classification accuracy. However, for new applications, it fails to do justice.

2.6.3 Semi-supervised machine learning approach

The multi-level cluster analysis (unsupervised), when used in conjunction with compound classification (supervised), promises higher accuracy and completeness in application identification. Valentin et al. [38] proposed Deep Packet Inspection (DPI) techniques for the system model and used C5.0 classifier for network traffic classification. Erman et al. [39] proposed the adequacy of administered Naive Bayes and the Auto class bunching calculation. The traffic characterization framework’s disconnected condition is utilized to pre-process and deliver preparing information for online classifiers of experiments in A. B. Mohammad et al. [40]. J. Zhang et al. [42] proposed the semi-supervised approach, focusing only on new applications. Erman et al. [43] proposed a probabilistic assignment of the unsupervised clustering algorithm for assigning application names to clusters. M.Hall et al. [44] applied the data mining tool (WEKA), combining unsupervised and supervised algorithms to estimate each Machine Learning algorithm’s efficiency.

2.7 Our proposed multi-phased statistical based classification approach

Our work is designed to handle even newer (new) application traffic in the University Campus Network with reduced complexities. This is attained through the proposed semi-supervised ML approach that promotes the traffic classification method. Our work’s processing is as follows: Our scheme contains two main segments: Offline Pre-Learning Phase and Online Classification Phase. Figure 1 portrays the overall idea of the proposed work. The disconnected characterization’s primary point is discovering the application models that must be unmistakable or separate from different applications. The similar 5-Tuple information of consecutive traffic packets are collected, and all the Statistical Dissemination (Table 3) are counted beneficial to find the differentiation between the applications. The valid number of Statistical Dissemination is taken for each different application to improve the classification accuracy. The invalid packets are omitted considering the size of the payload, and only legitimate packet counts are taken for each application to reduce the time and space complexity. Employing Statistical Dissemination and its proportion, the advanced K-Medoid Clustering algorithm clusters the similar flows with an innovative function (distance matrix), and the models (representatives) are created for each application. In the online classification phase, the continuous traffic flows are captured, and flows of the same sessions are grouped based on 5-Tuple information. The bunched application flows are then sent to cutting edge C5.0 classifier for preparing the classifier to demonstrate a decision tree capable of classifying the network traffic accurately based on prior knowledge of the application model. Every module of our proposed work is portrayed in the accompanying sections.

Fig. 1

Framework architecture of multi-phased statistical classification scheme.

Algorithm 1 Offline Pre-Classification Phase
Input:Training Network Traffic Flows
Output:Clustering Results
1: procedure Traffic Application Protocol ThumbprintT,A_ai
2: fori ← 1, ndo
3: T ← {A_ai}
4: forj ← 1, m
5: $A_{ai} \leftarrow \underset{A}{argmax} {P (SD [j], SDP [j])}$
6: end for
7: end for
8: end procedure
9: procedure Advanced K-Medoid ClusteringSD, SDP, D_ij, AM_i
10: fora ← 1, EOFdo
11: fori ← 1, ndo
12: forj ← 1, mdo
13: Distance MetricD_ijis
14: $\sqrt{({SD}_{ia} - {SD}_{ja})^{2} + ({SDP}_{ia} - {SDP}_{ja})^{2}}$
15: $+ \sqrt{({SD}_{ia})^{2} + ({SDP}_{ja})^{2}}$
16: Choose Initial Medoid
17: k←Most Centered Medoid
18: ifk> Initial Medoid
19: Renovate Medoid
20: k←New Medoid(Min Dist.)
21: else
22: Allocate items to Medoid
23: k←Nearest Medoid
24: end if
25: Application ModelAM_iis
26: ${AM}_{i} \leftarrow \frac{D_{ij}}{D_{ja}}$
27: SortAM_iin ascending order
28: end for
29: end for
30: end for
31: end procedure

Algorithm 2 Online Classification Phase
Input:Testing Network Traffic Flows
Output:Classification Results
1: procedure Application Individual Flow Cluster GroupingT_e, A_ai
2: fora ← 1, EOFdo
3: fori ← 1, ndo
4: each appA_aicontainsx_nflows
5: A_ai ← x_i
6: T_econtains 5-Tuple information
7: T_e ← y_li
8: Group the correlated Flows by
9: 5-tuple Information
10: A_ai ← A_ai ∪ {x_i : x_i+1 ∈ A_ai} ∃ y_li ∈
11: T_eiscorrelatedtoy_li+1
12: end for
13: end for
14: end procedure
15: procedure Advanced C5.0 ClassifierCSI_stat, FS_i, AM_i, A_ai
16: fori ← 1, ndo
17: forr ← 1, ndo
18: CSI_stat ← FS_i
19: forP_r ∈ FS_i, ndo
20: FS_i ← P_r
21: end for
22: end for
23: end for
24: Construct Advanced C5.0 Classifier
25: Classifier ← Application ModelAM_i+
26: Application Individual Flow ClustersA_ai
27: end procedure

3 Multi-phased statistical classification scheme

Traffic Classification begins with initial offline Pre-classification that contributes to two main modules: Traffic Application Protocol Thumbprint and advanced K-Medoid Clustering. The Online phase classification includes Session Grouping and Application Classification, using prior knowledge of pre-labeled traffic flows.The detailed process for Multi-Phased Statistical Classification Scheme is described in Algorithm 1 and 2.

3.1 Offline pre-classification phase

The accompanying subsections portray the procedure associated with offline pre-grouping.

3.1.1 Traffic Application Protocol Thumbprint

By watching the conduct of traffic, every application is separated from their Statistical Dissemination. The flows having a place with a similar application have comparative Statistical Dissemination even while indicating uniqueness between flows having a place with various applications. Each steady traffic flow is marked with Statistical Dissemination and its proportion (protocol Thumbprint). Still considering only the maximum availability of Statistical Dissemination and its proportion for each traffic flows to avoid large memory space consumption. The following equation illustrates the maximum availability of Statistical Dissemination and its proportion of Traffic Flows. $T = {A_{a 1}, A_{a 2}, \dots A_{a_{n}}},$ (1) $A_{ai} = \underset{A}{argmax} \sum_{j = 1}^{m} P (SD [j], SDP [j])$ (2) In (1),T represents the collection of different application traffic A_ai in the network measurement, and in (2) SD and SDP denote the maximum availability of Statistical Dissemination and its proportion. The maximum availability of Statistical Dissemination and its proportion is taken into account for each traffic flow; however, for better performance, the protocol Thumbprint has to be updated frequently.

3.1.2 Advanced K-Medoid Clustering

In our work, a new form of the advanced k-Medoid algorithm is proposed, whereas the K-means algorithm is complicated for selecting starting points and can only be used on trivial datasets. An innovative function is used to find the starting points for the next optimal cluster center to address this problem. The Proposed Clustering Algorithm leans to choose k most centrally positioned object as the initial medoid. The distance matrix is calculated once and is used for finding new clusters at each iterative step.

Suppose that n traffic flows having m Statistical Dissemination features should be grouped into k (k < n) clusters, where k is assumed to be given. Let us define j the Statistical Dissemination of traffic flow i as (i = 1...n ; j = 1...m). The distance metric (3) is utilized to gauge the degree of likeness between traffic flow i, and j is given by $\begin{matrix} D_{ij} = \sum_{j = 1}^{m} \sqrt{({SD}_{ia} - {SD}_{ja})^{2} + ({SDP}_{ia} - {SDP}_{ja})^{2}} + \\ \sum_{j = 1}^{m} ({SD}_{ia})^{2} + ({SDP}_{ja})^{2} \end{matrix}$ (3) Then the algorithm is made out of the accompanying three stages

Step 1: Choose Initial Medoid Figure the separation between each pair of all items dependent on the picked similarity measures. Calculate the Application Model AM_i (4) for a traffic flow i as follows ${AM}_{i} = \sum_{j = 1}^{m} \frac{D_{ij}}{\sum_{a = 1}^{n} D_{ja}}, j = 1 ... m;$ (4) Sort AM_i in rising order; select k objects having the primary k littlest qualities as beginning Medoids. Acquire the underlying cluster result by doling out each item to the closest Medoid. Figure the entirety of good ways from all items to their Medoids. At last, it chooses the k most center objects as introductory Medoids.

Step 2: Renovate Medoid Find another Medoid of each Cluster, which is the article restricting the hard and fast partition to various things in its gathering. Remodel the Current Medoid in each gathering by superseding it with the new Medoid.

Step 3: Allocate items to Medoid Consign everything to the nearest Medoid and obtain the Clustering result. Figure the aggregate of the right ways from all things to their Medoids. If the whole is proportionate to the previous one, at that point, stop the calculation. If not, return to Step 2. The Advanced k-Medoid algorithm is preferred over other clustering methods because of its simplicity, less touchiness to outliers, and computational efficiency.

3.1.3 Application model

The application Model A_ai for each application is structured in a table for online classification. Utilizing distinctive application models may cause entirely unexpected arrangement results. It ought to be noticed that Application Models may vacillate for various stages.

3.2 Online classification phase

The following subsections describe the methodology involved in the Online Classification.

3.2.1 Application individual flow cluster grouping

Initially, statistical features are utilized to assemble the individual traffic flows per application. The statistical flow features are scrutinized using packet header instead of the packet payload to avoid deep packet inspection, which leads to massive overhead. Each flow is managed by 5-Tuple Data, including Source IPAddress, Destination IPAddress, Source Port Number, Destination Port Number, and Protocols. The individual traffic flows are bunched by the statistical features of 5-tuple data caught from the IP packet header. On the off chance that the source IP address and goal IP address of two diverse individual flows are the same and their Source Port numbers are progressive, as the operating system assigns consecutive port numbers for connecting the remote host, the correlated flows will be grouped as clusters. The following equation asserts the allied flows for an application. $A_{ai} = {x_{1}, x_{2}, x_{3} . . . . x_{n}}; 1 \leq i \leq n,$ (5) In (5), each application A_aicontains x_n number of flows. The following equation illustrates the Tuple information T_e of different flows for an application. $T_{e} = {y_{l 1}, y_{l 2} . . . . . y_{\ln}}$ (6) In (6), y_li represents the tuple information of individual flows. Additionally, an involuntary process is started to group the correlated traffic flows by the tuple information in the Application A_ai. $\begin{matrix} A_{ai} = A_{ai} \cup {x_{i} : x_{i + 1} \in A_{ai}} \exists y_{li} \in T_{e} \\ is correlated to y_{li + 1} \end{matrix}$ (7) In (7), each correlated traffic flow is clustered based on 5- tuple information and grouped as similar individual flows. Table 2 describes the experimental results of correlated individual flows.

Table 2

Grouping of individual flows using 5-Tuple Data

SourceIP Address	DestinationIP Address	Source Port Number	Protocol	Grouping of Flows (Flow_Id)
10.1.173.146	192.168.2.51	55862	TCP	X
10.1.173.146	192.168.2.51	50006	TCP	X
10.1.173.146	192.168.2.51	58831	TCP	X
10.1.173.146	192.168.2.51	50013	TCP	X
10.1.173.146	192.168.2.51	52568	TCP	X
10.1.173.146	192.168.2.51	53475	TCP	X
10.1.173.146	192.168.2.51	49992	TCP	X+1
10.1.173.146	192.168.2.51	49985	TCP	X+1
10.1.173.146	192.168.2.51	45163	TCP	X+1
10.1.173.146	192.168.2.51	40170	TCP	X+1
10.1.173.146	74.125.236.181	30172	TCP	X+2

3.2.2 Advanced C5.0 classifier

At this stage, the Advanced C5.0 classifier models a decision tree consuming pre-trained data samples and Application flow cluster grouping. The C4.5 [40] is the precursor of the C5.0 algorithm, which forecasts the dependent feature set to produce the most acceptable traffic classification. Initially, the Cardinality Set Information (CSI) maintains the statistics of independent feature sets separated and recorded from the college grounds, which is valuable in advancing the unmistakable choice tree. The clustered individual flow record is afterward served to Advanced C5.0 classifier for supervised knowledge to classify a decision tree. Table 3 recorded the independent flow feature set information (CSI). The accompanying condition speaks to the cardinality include set given to the Advanced C5.0 classifier. ${CSI}_{stat} = {{FS}_{1}, {FS}_{2}, {FS}_{3} . . . . {FS}_{n}}$ (8) ${FS}_{i} = \sum_{r = 1}^{n} \sum_{P_{r} \in {FS}_{i}}^{n} P_{r}$ (9) In (8),FS_i signifies the independent feature set data and in (9),P_r speaks the properties of every independent feature set. Subsequently, the classifier uses the independent flow feature set FS_i; also, the pre-prepared information tests to develop the decision tree. The Advanced C5.0 classifier focuses on locating the transcendent feature set that recognizes every network traffic at every period of emphasis. Finally, the decision tree is built by isolating diverse application traffic from the entire ongoing informational collection. The decision tree is utilized to arrange the consequent collection of test cases by the generated rules. The C5.0 classifier comprises a single command-line interface used to create the decision tree and rules, thus testing the classifier with new sets of test cases. To enhance efficiency and accuracy, Advanced C5.0 accomplishes advanced opportunities for boosting, pruning, and winnowing [36]. The techniques presented in the Advanced C5.0 classifier:

3.2.3 Boosting method

a combined set of classifiers is used to predict the final traffic class by surveying each classifier’s vote sum.

3.2.4 Misclassification costs

pruning the branches of the decision tree that causes an error at each iteration.

3.2.5 Winnowing method

in sampling and cross-validation phase, the feature set is concentrated that has a low predictive ability in classification.

The following section portrays the trial aftereffects of the proposed work and the examination with other substitute Machine Learning Algorithms for traffic grouping.

4 Experimental outcomes and conversation

This section manages the trial handling of the proposed plan and its outcomes and conversations. The performance is then compared with existing approaches. The trial work’s framework necessities incorporate the standard framework Intel Core 3 Duo Processor 2.20GHz, 4.00 GB RAM, Microsoft Windows 13, and Linux Ubuntu 20.04 working framework to run the proposed plot. We tracked the R Data Mining Tool for the usage of the suggested approach.

4.1 Dataset collection

In this stage, the traffic follows from 17 well-known web applications like HTTP, FTP, DNS, eDonkey, Skype, SMTP, etc., that can allude from Figure 5. A traffic filter is used to extract the traffic flows automatically based on standard configurations that filter out irrelevant traffic [45 –47]. The network traces are captured from AU-IDS using Wireshark, producing a total of 35GB of data. An equal proportion of network traffic flows are taken for Training and Testing Phase. A few feature sets with 20 Statistical Dissemination from each traffic flow listed in Table 3 are taken for evaluation. The system traffic is gathered utilizing packet sniffer instruments like Wireshark, and the 20 Statistical Dissemination highlights are removed utilizing our GCC program.

Table 3
Feature Sets Utilized for the Advanced C5.0 Grouping Technique

Si. No Feature set Approach Parameters used by the Technique

1 Set1 IP Based Traffic Identification

1 Source IP address, Destination IP address.

2 Set2 Service-Based Traffic Identification

1 Source IP address, Destination IP address.

2 Source Port (No, Labels),Destination Port (No, labels)

3 Transport Layer Protocol(TCP, UDP)

3 Set3 Statistical Feature-Based Information

1 Data Speed and Packet Speed.

2 Inter-appearance time Statistics(Minimum, Mean, Maximum and Standard deviation).

3 Packet size Statistics(Minimum, Mean, Maximum and Standard deviation)

4 Total Packet Calculation

5 Maximum Bytes Transmitted

6 Flow Length.

7 Flow size packets and bytes.

4.2 Implementation of training phase

The real-time Anna University campus traffic traces (AU-IDS) are used to investigate the proposed algorithm’s performance. The AU-IDS dataset embraces 17 traffic Applications with almost 10T traffic flows in each application. Each steady traffic flow is marked with Statistical Dissemination and its proportion (protocol Thumbprint). It is considering only the maximum availability of Statistical Dissemination and its proportion for each traffic flow to avoid large memory space consumption. The steady traffic flows are examined for each specific application and are grouped with the respective traffic class based on the Clustering Algorithm.

4.2.1 Selection of initial medoid

The performance of selecting the initial medoid is compared with the other different methods like

4.2.2 Systematic selection (SS) method

Sort all Objects in the request for estimations of the Statistical Dissemination. Isolate the scope of qualities with equivalent stretches and Select K-objects haphazardly from every span.

4.2.3 Random selection (RS) method

Arbitrarily Select K-objects from all Traffic Applications

4.2.4 Outmost object (OO) selection method

Select the Outermost K-object from the cluster mean.

4.2.5 Sampling (SM) method

Discretionarily taking 25% (Sampling) from all applications as an example and playing out an underlying grouping on these examined applications utilizing the Advanced K-Medoid Clustering calculation. The resultant k medoid is utilized as the underlying medoid. Comparing the accuracy rate of different Selection of initial Medoids with the Proposed Method (PM) is depicted in Fig. 2. The four test cases taken for each algorithm is 1 × 10⁵, 1.5 × 10⁵, 2.5 × 10⁵and3 × 10⁵ traffic flows. Figure 2 shows that the accuracy rate of Selection of Initial Medoids by Method SS and OO seems to be inadequate compared with RS and SM. The accuracy rate of Test cases for different selection methods demonstrate the level of effectively perceived traffic application. The accuracy rate increases for higher test cases in each method. The accuracy rate of RS and SM is relatively equal but seems inferior to the Proposed Method. It may be concluded that the PM shows higher accuracy when compared with other methods for selecting the Initial Medoids.

Fig. 2

Accuracy rate of test cases for different selection methods.

4.2.6 Evaluation of Advanced K-Medoid Clustering

We operated the Proposed method with K=17 to the real-time data with known Traffic labels. The attributes (Set 3) specified in Table III are considered for clustering. The vector representation of traffic classes will be specified for each specific application. The Advanced K-Medoid algorithm is applied to the individual flows to map them to their respective traffic class, which is implemented using the R programming language. The flows having a place with a similar application have comparative Statistical Dissemination. At the same time, they show the divergence between flows having a place with the various applications, which are portrayed in Fig. 3 with the packet size and Fig. 4 with the inter-arrival time feature of Statistical Dissemination. Additionally, Fig. 5 shows the different applications have diverse packet sizes. The individual progressions of traffic are gathered into 17 (k) bunches of the particular class, and the grouping precision against the genuine classes is delineated in Fig. 6. The exactness is processed on the correctly classified Traffic Classes against the actual classes. It quantifies the exactness paces of session acknowledgments by isolating the number of effectively distinguished Traffic Classes by the all outnumber of Traffic Classes. The Clusters that will be narrowed within different flow classes generated are made for our future work. After grouping each flow with the respective traffic classes, each application is structured in a CSV file for online classification. The Advanced C5.0 classifier is fed all 17 internet traffic classes in equivalent extents (half) for training and testing stages.

Fig. 3

Packet size distribution of SKYPE, FACEBOOK AND YOUTUBE traffic classes.

Fig. 4

Inter arrival distribution of YOUTUBE AND SKYPE.

Fig. 5

Normal Packet Size for various applications.

Fig. 6

Accuracy Rate of Each Application by Advanced K-Medoid Algorithm.

4.3 Implementation of testing phase

Each correlated traffic flow is clustered based on 5- tuple information and grouped as similar individual flows. Based on statistical features (5-Tuple information), the operating system (Microsoft Windows 13, Linux Ubuntu 20.04, Microsoft Windows XP, Microsoft Windows, and Vista) assigns consecutive port numbers to the individual flows because of the speedy process of resource allocation to the requests given as shown in Fig. 7. This method is considered only for observing the partnership between enormous traffic flows. Our scheme’s strategic trace is to identify the similar flow records per application, which is used to train the classifier with a complete flow footprint of each application, thus classifying the network traffic efficiently.

Fig. 7

Identifying Consecutive Port Numbers used by the application.

4.3.1 Evaluation of advanced C5.0 classifier

The traffic flows marked with suitable traffic classes by Advanced K-Medoid clustering is fed to train the classifier. It repeats boosting and pruning; Advanced C5.0 models a decision tree for classification implemented in R that classifies the online traffic accurately during the testing phase. The attribute selection in classifier learning should be closely related to individual flows that distinguish dissimilar traffic flows and the Clustering algorithm that segregates traffic classes. As referenced in Table 3, Set 1 keeps up IP level traffic order (Source IP locations and Destination IP addresses), Set 2 keeps up Service-Based Traffic Identification (Source IP locations, Destination IP addresses, Source Port Number, Destination Port (0-1023 known, obscure >1023)Number, Protocols (TCP/UDP)) and Set 3 keeps up Statistical element based Information (Statistical flow properties). The Advanced C5.0 machine learning algorithm was prepared and tried with the list of capabilities 1 to 3.The list of capabilities 1 keeps up IP tends to bring about the precision of 79.5% alone, and the mistake rate, for this situation, is 20.5%.The list of capabilities 2 keeps up three distinctive flow characteristics bringing about the exactness of 82.13% higher than the precision pace of Set1, and the general blunder rate for this case is 17.86%. The component Set3 keeps up 15 distinctive flow characteristics bringing about a broad higher exactness pace of 98% contrasted and another list of capabilities choices, and the blunder rate for this case is 1.9%. The Evaluation time taken for experiments is around 5.0 to 7.9 seconds for each component determination. The precision rate can be improved by utilizing a few cases for preparing and testing stages. Figure 9 portrays the Classification table utilizing Feature Set 3 for preparing and arranging on the web traffic in R. The attribute usage for each feature set selection, error rate, and Accuracy rate is given in Table 4.

Fig. 8

Comparison of Proposed Scheme With Other Algorithms.

Fig. 9

Classification table for the Traffic’s Collected.

Table 4

An Error Rate of Attribute Usage in Advanced C5.0 Classifier

Feature set	Technique	Parameters used by the Technique	Attribute usage (%)	Errors (%)	Accuracy Rate (%) Trials=100
Set1	IP Level Traffic Classification	1 Source IP address, Destination IP address.	87.28%	20.5%	79.5%
Set2	Service-Based Traffic Identification	1 Source IP address, Destination IP address.	87.28%	20.5%	79.5%
		2 Source Port (No, Labels), Destination Port (No, labels)	100%	20.0%	80.0%
		3 Transport Layer Protocol (TCP, UDP)	84.45%	14.2%	85.8%
Set3	Statistical Feature-Based Information	1 Data Speed and Packet Speed.	98%	2.0%	98.0%
		2 Inter-appearance time Statistics (Min, Mean, Max and SD).	100%	1.5%	98.5%
		3 Packet size Statistics (Min, Mean, Max and SD)	100%	1.5%	98.5%
		4 Total Packet Calculation	100%	2.5%	97.5%
		5 Maximum Bytes Transmitted	100%	2.0%	98.0%
		6 Flow Length.	99.7%	2.0%	98.0%
		7 Flow size packets and bytes.	100%	2.0%	98.0%

4.4 Comparison with other proposals

Our exploratory examination utilized 17 distinctive main flow of Internet traffic classes while the past methodologies manage just a couple. We sourced the Waikato Environment for Knowledge Analysis (WEKA) device, contrasting the suggested and current methodologies. Our staggered proposed conspire functions admirably because it groups all unique, measurable practices of a traffic class with a high precision rate, though different calculations distinguish one or different practices of a traffic class. The proposed traffic grouping plan is contrasted and, Naive Bayes+Decision Tree(NB-DT), Finest-first Decision Tree(FFDT), kNN+Bayesian Network(kNN-BayesNet), k-Means+C5.0(kM-C5.0) and K-Medoid+J48/C4.5(kMD-C4.5DT).

To attempt a completely subjective assessment of the Multi-Phased ML approach, we considered substitute ML classifiers and evaluated their reasonability for per-flow traffic characterization comparable to the proposed method. Weka suite was utilized to assess the five most ordinarily used machine learning algorithms in correlation with the proposed approach [48 –55]. The classifiers utilized a similar proportion of preparing and testing informational index pools (set apart with individual application class). Half of the flows were utilized for preparing the separate classifier, and the staying half flows were utilized for testing purposes. We quickly portray the machine learning algorithms that were assessed as follows.

4.4.1 Naive Bayes+Decision Tree(NB-DT)

NB-DT is a half breed classifier which consolidates choice tables alongside naive Bayes and assesses the advantage of isolating accessible highlights into disjoint sets to be utilized by every calculation separately. Utilizing a forward determination search, the chose credits are demonstrated utilizing NB and choice table (contingent likelihood table), and at each progression, and superfluous credits are taken out from the last model. The joined model allegedly performed better than individual naive Bayes and choice tables and was executed with default boundaries.

4.4.2 Finest-first Decision Tree(FFDT)

The Finest-first choice tree (FFTree) utilizes paired parting for ostensible just as numeric credits and uses a top-down choice tree induction approach with the end goal that the best split is included at each progression. As opposed to profundity first request in every iterative tree generation step, the calculation grows hubs in the best-first request rather than a fixed request. Both addition and Gini record are used in figuring the best hub in the tree development stage. The calculation was executed utilizing post pruning empowered and with a default estimation of 5-fold in pruning to improve the subsequent classifier.

4.4.3 k-Nearest Neighbours+Bayesian Network(kNN-BayesNet)

k closest neighbors (kNN) calculation figures the separation (Euclidean) from each test to the k closest neighbors in the n-dimensional component space.The classifier chooses the main flow label class from the k closest neighbors also appoints it to the test model.Bayesian Network (BayesNet) is a non-cyclic coordinated chart that speaks to many highlights as its vertices and the probabilistic relationship among highlights as diagram edges.While utilizing Bayes’ standard for probabilistic derivation, under invalid contingent freedom suspicion (in naive Bayes), BayesNet may beat NB and yield better order precision. The default boundaries, that is, Simple Estimator, were utilized for assessing the contingent likelihood tables of BayesNet in the Weka execution of BayesNet on the preparation set.

4.4.4 k-Means+C5.0 Decision Tree(kM-C5.0)

The named information accessible is taken care of to K-Means grouping for gathering up of substance and helper flows of the application. Every application is again named with traffic that is genuinely application traffic and other strengthening flow inside it. These are the flows produced by the application. They speak to the complete information about every application.The C4.5 [40] is the C5.0 algorithm’s precursor, which forecasts the dependent feature set to produce the finest traffic classification. Initially, the Cardinality Set Information (CSI) maintains the statistics of independent feature sets separated and recorded from the college grounds, which is valuable in advancing the unmistakable choice tree. Afterward, the clustered individual flow record is served to C5.0 classifier for supervised knowledge to classify a decision tree.

4.4.5 K-Medoid + J48 / C4.5 Decision Tree (kMD-C4.5DT)

Employing Statistical Feature and its proportion, the K-Medoid Clustering algorithm clusters the similar flows with an innovative function (distance matrix), and the models (representatives) are created for each application. J48/C4.5 choice tree builds a tree structure, in which every hub speaks to statistical feature tests, each branch speaks to a result (output)of the test and each leaf hub speaks to a class name, that is, application flow name in the current work. To utilize a choice tree for Traffic Classification, guaranteed tuple (which requires class expectation) comparing to statistical features stroll through the choice tree from the root to a leaf. The name of the leaf hub is the Traffic Classification result. The calculation was empowered with default boundaries in the Weka Tool of the current test to improve the subsequent choice tree.

Besides, our proposed plot only beats another order calculation when new traffic classes exist. In comparison, the other arrangement calculation neglects to recognize more current applications to have a lower execution in the preparation stage. The comparison of the accuracy rate of different machine learning algorithms is depicted in Fig. 8. The four test cases taken for each algorithm are 1 × 10⁵, 1.5 × 10⁵, 2.5 × 10⁵and3 × 10⁵ traffic flows.The accuracy rate increases for higher test cases in each classification algorithm.

In general, the Multi-phased methodology accomplished better per-flow arrangement in examination with the substitute procedures, while for hardly any applications (flow types), the characterization precision was practically equivalent. For the game, set up flows, the precision is the most noteworthy. For game control flows, substitute methodologies, for example, kNN-BayesNet and kM-C5.0, give a superior level of accurately recognized flows. This was viewed as before while assessing a Multi-staged classifier’s affectability and was, for the most part, because of misclassification blunders (of game control) with the web perusing flows.kNN-BayesNet and kM-C5.0, nonetheless, give a lower exactness than Multi-staged ML for perusing and flowing flows. Also, for the flowing application level, kMD-C4.5DT based methodology yielded exceptionally exact outcomes similar to the Multi-staged machine learning approach while it yielded negligible exactness when the email level was analyzed. For the correspondence application flows, practically all classifiers except for FFDT (83%) gave the right arrangement results (93%). This was principal because of the prescient capacity of flow boundaries for this arrangement of utilizations. For deluge based flows, the choice tree alongside NB gave practically 90.99% flow recognizable proof capacity of downpour control flows because of confusing with game control and perusing flows. Consequently, while one methodology may be appropriate for distinguishing specific traffic flows, comparable high precision probably will not be acknowledged for an alternate application utilizing a similar classifier. Regarding generally precision, Multi-Phased ML gave a significantly more cognizant and material outcome at 98.466% effectively arranged records.

The NB-DT and FFDT calculation results with a precision of 90.9% and 93.38%. The kNN-BayesNet and kM-C5.0 calculation (SA) shows a higher precision than the over two calculations, which was 95% and 95.94%. Finally, the kMD-C4.5DT calculation results with a precision of 96.12%, and the proposed work (MSC) beat all above with (98.46%) exactness rate in rush hour gridlock grouping.

4.5 Discussions

We experimentally contemplated the predominant exhibition of the proposed strategy during countless trials on AU-IDS traffic datasets. Some significant perceptions and examinations on the trial results are as per the following.

The proposed technique beats past administered traffic characterization strategies when obscure applications are available in reality traffic datasets. The past Traffic Classification techniques would group obscure flows into pre-characterized known classes, which prompts low arrangement exhibitions, mainly when the number of obscure applications is enormous. In any case, the proposed strategy can recognize obscure flows and order known flows precisely.

The proposed approach is the Application Individual Flow Cluster Grouping, which is, essentially, an automatic naming technique dependent on the 5-tuple. The Application Individual Flow Cluster Grouping is one motivation behind why the proposed approach works so well when the truth is told; not many directed examples are accessible. The other explanation is of compound characterization, which can mutually group connected flows all the more precisely. Standard managed techniques perform gravely even without obscure traffic if the size of the regulated preparing set is excessively little. Since the flow name engendering is autonomous of characterization calculations, we can later utilize it as a pre-preparing venture with any managed strategies to build the regulated preparing set size. Be that as it may, it ought to be called attention that this paper’s critical worry is obscure traffic. The Application Individual Flow Cluster Grouping can not manage obscure traffic clear. We proposed a semi-supervised plot by joining the Offline and Online phase to deal with obscure traffic adequately. It ought to be brought up that this paper does not address traffic grouping across networks. In this work, all strategies are proposed to manage obscure applications on an organization.

4.5.1 Application Background

The current innovation identifies with the characterization of organization traffic for the reasons for investigation, announcing, and control and, more significantly, to strategies, mechanical assemblies, and frameworks that encourage the ID and order of web administrations network traffic.

Enterprises have gotten progressively reliant on PC network frameworks to offer types of assistance and achieve crucial errands. Undoubtedly, these organization frameworks’ exhibition, security, and proficiency have become basic as endeavors increment their dependence on distributed computing environments and wide area computer networks.

Web administrations networks are quickly developing innovation models permitting applications to take advantage of administrations’ assortment in an amazingly professional and savvy way. Web administrations empower practical and proficient joint effort among elements inside a venture or across undertakings. Web administrations are URL or IP addressable assets that trade information and execute measures. Web administrations are applications uncovered as administrations over a PC organization; furthermore, they are utilized by different applications utilizing Internet standard advancements, such as XML, SOAP, WSDL, and so forth. As needs are, Web applications can be rapidly and effectively gathered with administrations accessible inside an endeavor WAN or outer administrations accessible over open PC organizations, for example, the Web.

To be sure, an expanding number of organization applications utilize information pressure, encryption innovation, or potentially restrictive conventions that cloud or forestall recognizable proof of other application-explicit credits, regularly leaving significant port numbers as the main reason for the arrangement. Truth be told, as organized applications become progressively mind-boggling, information encryption or potential pressure has promoted security or improvement. Indeed, information encryption tends to worry about security and protection issues yet makes it significantly more challenging for the middle of the road network gadgets to recognize the applications that utilize them.

Considering the prior, a need for craftsmanship exists for expanding the productivity and execution of organization traffic arrangements. A need likewise exists in the craftsmanship for decreasing the asset prerequisites related to network traffic order. Exemplifications of the current innovation considerably satisfy these necessities.

6 Conclusion

In this manuscript, we offered a delegate key for recognizing and ordering various sorts of application traffic on the University Campus. The proposed technique presents a semi-supervised approach, incorporating both unsupervised cluster analysis (offline Pre-Learning Phase) and a compound supervised (Online Classification Phase) method. It attempts to order the system traffic prominently with the capacity to handle more current applications. Traffic information from 17 well-known web applications was gathered from the different servers of AU-IDS. The invalid packets were omitted in light of the size of the payload, and only legitimate packet counts were taken for each application to reduce time and space complexity. The disconnected Pre-Learning Phase is to discover the application models that must be unmistakable or separate from different applications. The steady traffic flows are segregated by advanced K-Medoid clustering into the respective application traffic class, which is used to train the advanced C5.0 classifier to classify new online traffic accurately. The experimental results show that our proposed strategy (MSC) can classify the 17 popular real-world traffic obtained in our University campus with an accuracy of 98%. The performance evaluation revealed that the proposed work outperforms the classification methods like Naive Bayes, Semi-Supervised, and C4.5. Our future work is to expand the suggested procedure for improving the ground truth data in real-world situations.

References

Henderson

, Kotz

and Abyzov

, âĂ İ The Changing usage of mature campus-wide wireless networkâĂİ, Computer Networks 52(14) (2008), 2690–2712.

Zink

, Suh

, Gu

and Kurose

, âĂ İ Characteristics of You Tube network traffic at a campus network-measurements, models and implicationsâĂİ, Computer Networks 53(4) (2009), 501–514.

Service Name and Transport Protocol Port Number Registry (IANA), [Online]. Available: http://www.iana.org/assignments/portnumbers as of October, 2014

Fraleigh

, Moon

, Lyles

, Cotton

, Khan

, Moll

, Rockell

, Seely

, Diot

, Packet-level traffic measurements from the sprint IP backbone in IEEE Network, (2003).

Karagiannis

, Broido

, Brownlee

, Claffy

K.C.

, Faloutsos

, Is P2P dying or just hiding? In: IEEE GLOBECOM, November 2004.

Karagiannis

, Broido

, Faloutsos

, Claffy

K.C.

, Transport

K.C.

, K.C.Transport layer Identification of P2P traffic, in: Internet Measurement Conference (IMC), October 2004.

Sen

, Spatscheck

, Wang

, Accurate, scalable in network identification of P2P traffic using application signatures, in: WWW2004, May 2004.

Roesch

, SNORT: Lightweight intrusion detection for networks, in: LISA âĂŹ99: Proceedings of the 13th USENIX Conference on Systems Administration, November 1999.

Paxson

, Bro: a system for detecting network intruders in real time, Computer Networks 1999.

10.

Karagiannis

, Papagiannaki

and Faloutsos

, âĂIJBLINC: multilevel traffic classification in the dark, âĂ İ SIGCOMM Comput,, Commun Rev 35 (2005), pp. 229âĂŞ240, Aug. 2005.

11.

, Zhang

Z.-L.

, Bhattacharyya

, âĂİ Profiling Internet Backbone Traffic: Behavior models and applications,” SIGCOMM Comput. Commun, Aug-05

12.

Karagiannis

, Papagiannaki

, Taft

, Faloutsos

, âĂİProfiling the end hostâĂİ, in Proceedings of the 8th International Conference on Passive and Active Network Measurement, Springer, 2007.

13.

Mori

, Kawahara

, Hasegawa

, Shimogawa

, âĂİCharacterizing traffic flows originating fromlarge-scale video sharing servicesâĂİ, in Proceedings of the second international Conference on Traffic Monitoring and Analysis, Springer-2010

14.

Yoon

S.-H.

, Park

J.-W.

, Park

J.-S.

, Oh

Y.-S.

, Kim

M.-S.

, âĂ IJInternet Application traffic classification using fixed ipportâ Ăİ, in Proceedings of the 12th Asia-Pacific Network Operations and Management Conference on Management Enabling the Future Internet for Changing Business and New Computing services, Springer-2009.

15.

Nguyen

T.T.

and Armitage

, âĂ IJA survey of techniques for internet traffic classification using machine learning,âĂ İ, IEEE Commun Surveys & Tutorials 10(4), pp. 56âĂŞ76, Fourth Quarter 2008.

16.

Callado

, Kamienski

, Member, IEEE, Stałnio Fernandes, Member, IEEE, Djamel Sadok, Senior Member, IEEE, GÃl’za SzabÃs¸, BalÃązs PÃl’ter GerŚ, âĂ IJA Survey on Internet Traffic Identification and Classification, âĂİIEEE, 2007

17.

Zander

, Nguyen

, Armitage

, âĂ IJAutomated traffic classification and application identification using machine learning, âĂİ in Proc. 2005 IEEE Conference on Local Computer Networks, pp. 250âĂŞ257.

18.

Yingqiu

, Wei

, Yunchun

, âĂ IJNetwork traffic classification using K-means clustering,âĂ İ in Proceedings of the 2nd International Multi-Symposiums on Computer and Computational Sciences (IMSCCS âĂź07), pp. 360âĂŞ365, August 2007.

19.

McGregor

, Hall

, Brunskill

, Lorier

, âĂ IJFlow clustering using machine learning techniques,âĂ İ in Proc. 2004 Passive and Active Measurement Workshop, pp. 205âĂŞ214

20.

Bernaille

, Akodkenou

, Teixeira

, Soule

and Salamatian

, âĂIJTraffic classification on the fly,âĂ İ SIGCOMM Comput, Commun Rev 36, pp. 23âĂŞ26,Apr. 2006.

21.

Erman

, Arlitt

, Mahanti

, âĂ IJTraffic classification using clustering algorithms,âĂ İ in Proc. 2006 SIGCOMM Workshop on Mining Network Data, pp. 281âĂŞ286.

22.

Liu

, Lung

, âĂ IJP2P traffic identification and optimization using fuzzy c-means clustering,âĂ İ in IEEE International Conference on FuzzySystems, 2011, pp. 2245âĂŞ2252.

23.

Wang

, Xiang

and Yu

S.-Z.

, âĂ IJAn automatic application signature construction system for unknown traffic,âĂ İ, Concurrency Computat.: Pract. Exper 22 (2010), pp. 1927âĂŞ1944.

24.

Finamore

, Mellia

, Meo

, âĂ IJMining unclassified traffic using automatic clustering techniques,âĂ İ. in Proc. 2011 TMA International Workshop on Traffic Monitoring and Analysis, pp. 150âĂŞ163.

25.

Erman

, Mahanti

, Arlitt

, Williamson

, âĂ IJI dentifying and discriminating betweenweb and peer-to-peer traffic in the network core,âĂ İ WWW âĂź07: Proceedings of the 16th international conference on World Wide Web. Banff, Alberta, Canada: ACM Press, May 2007, pp. 883–892.

26.

Moore

A.W.

and Zuev

, âĂ IJInternet traffic classification using Bayesian analysis techniques,âĂ İ SIGMETRICS Perform, Eval Rev 33, pp. 50âĂŞ60, June 2005.

27.

Kim

, Claffy

, Fomenkov

, Barman

, Faloutsos

, Lee

, âĂ IJInternet traffic classification demystified: myths, caveats, and the best practices,âĂ İ in Proc. 2008 ACM CoNEXT Conference, pp. 1âĂŞ12.

28.

Auld

, Moore

A.W.

and Gull

S.F.

, âĂ IJBayesian neural networks for Internet traffic classification, âĂ İ IEEE Trans Neural Netw 18(1), pp. 223âĂŞ239, Jan. 2007.

29.

Haffner

, Sen

, Spatscheck

, Wang

, âĂ IJACAS: automated construction of application signatures,âĂ İ in MineNet âĂź05: Proceeding of the 2005 ACMSIGCOMM workshop on Mining network data Philadelphia, Pennsylvania, USA: ACM Press, August 2005, pp. 197–202.

30.

Bernaille

, Teixeira

, âĂ IJEarly recognition of encrypted applications,âĂİ in Proc. 2007 International Conference on Passive and Active Network Measurement, pp. 165âĂŞ175.

31.

Bonfiglio

, Mellia

, Meo

, Rossi

, Tofanelli

, âĂ IJRevealing Skype traffic: when randomness plays with you,âĂ İ in Proc. 2007 Con ference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 37âĂŞ48.

32.

Este

, Gringoli

, Salgarelli

, âĂ IJSupport vector machines for TCP traffic classification,âĂ İ Computer Networks 53(14), pp. 2476âĂŞ2490, Sept. 2009.

33.

Crotti

, Dusi

, Gringoli

, Salgarelli

, âĂ IJTraffic classification through simple statistical fingerprinting,âĂ İ SIGCOMMComput Commun Rev 37, pp. 5âĂŞ16, Jan. 2007.

34.

Valenti

, Rossi

, Meo

, Mellia

, Bermolen

, âĂ IJAccurate, fine grained classification of P2P-TV applications by simply counting packets,âĂ İ in Proc. 2009 InternationalWorkshop onTraffic Monitoring and Analysis, pp. 84âĂŞ92

35.

But

, Williams

, Zander

, Stewart

, Armitage

, âĂ IJANGEL- Automated Network Games Enhancement Layer, âĂ İ in Proceedings of 5th AnnualWorkshop on Network and Systems Support for Games (Netgames) Singapore, October 2006.

36.

C-N.

, Huang b

C-Y.

, Lin a

Y.-D.

, Lai c

Y.-C.

, âĂİSession Level Flow Classification by Packet size Distribution and Session GroupingâĂİ, Computer Networks, 2011.

37.

Bujlow

, Riaz

, Pedersen

J.M.

, âĂ IJA method for classification of network traffic based on C5.0 machine learning algorithmâ Ă İ, in Proceedings of the International Conference on Computing, Networking and Communications (ICNC âĂź12), pp. 237–241, Maui, Hawaii, USA, February 2012.

38.

carela-Espanol

, Barlet-Ros

, Mula-Valls

, Sole-Pareta

, âĂİ An Automatic Traffic Classification System for network operation and ManagementâĂİ, Springer, October 2013

39.

Erman

, Mahanti

, Arlitt

, âĂ IJInternet traffic identification using machine learning, âĂ İ in Proc. 2006 IEEE Global Telecommunications Conference, pp. 1âĂŞ6.

40.

Mohammed

A.B.

and Nor

S.M.

, âĂ IJNear real time online flow based internet traffic classification using machine learning (C4.5),âĂİ, International Journal of Engineering 3(4) (2009), 370–379.

41.

Bakhshi

, Ghita

, âĂİ On Internet traffic Classification: A Two-Phased Machine Learning ApproachâĂİ, Journal of Computer Networks and Communications, vol. 2016

42.

Zhang

, Chen

, Xiang

, Zhou

and Wu

, âĂ I JRobust network traffic classification,âĂ İ, IEEE/ACM Transactions on Networking 23(4) (2015), pp. 1257âĂŞ1270.

43.

Erman

, Mahanti

, Arlitt

, Cohen

and Williamson

, âĂIJSemisupervised network traffic classification,âĂ İ in Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS âĂź07), San Diego, Calif USA, June 2007, Performance Evaluation Review 35(1) (2007), 369âĂŞ370.

44.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I.H.

, âĂ IJThe WEKA data mining software: an update, âĂ İ, ACM SIGKDD Explorations Newsletter 11(1) (2009), 10–18.

45.

The CAIDA UCSD, Anonymized OC48 Internet Traces Dataset, http://www.caida.org/data/dataset-URL

46.

Dusi

, Gringoli

and Salgarelli

, Quantifying the accuracy of the ground truth associated with Internet traffic traces, Elsevier Computer Networks 55(5), pp. 1158–1167, April 2011.

47.

Gringoli

, Salgarelli

, Dusi

, Cascarano

, Risso

and Claffy

K.C.

, GT: picking up the truth from the ground for Internet traffic, ACM SIGCOMM Computer Communication Review 39(5), pp. 13–18, Oct. 2009.

48.

Baris

, Amac Guvensan

, Gokhan Yavuz

, Elif Karsligil

, Application identification via network traffic classification. In 2017 International Conference on Computing, Networking and Communications (ICNC), (2017), pp. 843–848. IEEE.

49.

Jenefa

, Vinodh

S.E.

, Application Identification using Supervised Clustering Method, International Journal of Engineering Research and Applications (IJERA) ISSN (2013), 2248–9622.

50.

Jenefa

and BalaSingh Moses

, Multi level statistical classification of network traffic. In Inventive Computing and Informatics (ICICI), International Conference on, pp. 564–569. IEEE, 2017.

51.

Jenefa

and BalaSingh Moses

, An Upgraded C5.0 Algorithm for Network Application Identification. In 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 789–794. IEEE, 2018.

52.

Coulter

, Han

Q-L.

, Pan

, Zhang

, Xiang

, Data-driven cyber security in perspective–intelligent traffic analysis, IEEE transactions on cybernetics (2019).

53.

Qin

, Lei

, Bai

, Zhang

, Towards a profiling view for unsupervised traffic classification by exploring the statistic features and link patterns. In Proceedings of the 2019 Workshop on Network Meets AI & ML, (2019), pp. 50–56.

54.

Coulter

, Han

Q-L.

, Pan

, Zhang

, Xiang

, Data-driven cyber security in perspective–intelligent traffic analysis. IEEE transactions on cybernetics (2019).

55.

Zhao

, Cai

, Yu

, Xu

and Meng

, A novel network traffic classification approach via discriminative feature learning. In Proceedings of the 35th Annual ACM Symposium on Applied Computing, (2020), pp. 1026–1033.

Si. No	Feature set	Approach	Parameters used by the Technique
1	Set1	IP Based Traffic Identification
			1 Source IP address, Destination IP address.
2	Set2	Service-Based Traffic Identification
			1 Source IP address, Destination IP address.
			2 Source Port (No, Labels),Destination Port (No, labels)
			3 Transport Layer Protocol(TCP, UDP)
3	Set3	Statistical Feature-Based Information
			1 Data Speed and Packet Speed.
			2 Inter-appearance time Statistics(Minimum, Mean, Maximum and Standard deviation).
			3 Packet size Statistics(Minimum, Mean, Maximum and Standard deviation)
			4 Total Packet Calculation
			5 Maximum Bytes Transmitted
			6 Flow Length.
			7 Flow size packets and bytes.