Abstract
Network Traffic Classification (NTC) is an important technology for network management, traffic control, security detection and so on. With the development of the high-speed, large-scale complex networks, NTC appears some challenges in area of data storage and processing for massive network traffic. Although there are a few NTC based on cloud computing, its parallel computing model has not received enough attention. In this paper, based on the Selective Ensemble and Diversity Measures, we propose a novel Parallelized Network Traffic Classification framework (PNTC-SE-DM), which is used to parallel process the large-scale network traffic data by MapReduce architecture. In particular, in PNTC-SE-DM, we present a new method to select the classifiers for ensemble classification, which is closely related to both the prediction accuracy of the single classifier and the diversity among the multi-classifiers. The experimental results demonstrate that the new approach has the advantage of tackling large-scale network traffic data, and is favorable in terms of the evaluation metrics of speedup, sizeup and accuracy.
Introduction
Network Traffic Classification (NTC) [10] has a number of advantages, for instance, enhancing the network controllability, helping researchers to understand the flow distribution on the network, assisting Internet Service Provider (ISP) to improve Quality of Service (QoS), ensuring network security, and so on. However, the high-speed network and large-scale Internet data make NTC face series of challenges. Especially, the contradiction between massive growing network traffic data and the single-node system’s capacity becomes severely more and more.
Recently, based on statistics and Machine Learning (ML), some alternative methods were proposed [16,20], which makes classification more flexible and effective in practical network environments. However, the majority of these methods only consider a single classifier to finish traffic classification. Ensemble methods [13] with multi-classifiers show higher accuracy than those with a single classifier. Notice that there are some ensemble learning algorithms, such as bagging [18], boosting [11], etc. But the related works show that choosing part of the classifiers to do ensemble learning, called selective ensemble [17], has better generalization performance than choosing all. Researchers specially focus on the selecting of base classifiers.
MapReduce [6] is a parallel programming model proposed by Google to handle massive data, which is able to shield the underlying implementation details, reducing the difficulty of parallel programming efficiently. Thus, it is interesting to make use of the MapReduce framework to classify and integrate for massive network traffic data.
In this paper, a novel Parallelized Network Traffic Classification framework is proposed by using Selective Ensemble and Diversity Measures (PNTC-SE-DM). It is helpful for handling massive network traffic data on the cloud platform validly. Experimental results show that, the new method has higher efficiency of data processing in clustered environment than that in stand-alone environment, and gets good generalization ability, speedup and sizeup.
The rest of the paper is organized as follows. Section 2 introduces the previous studies related to our works. Section 3 gives concept of classifier diversity measures. Section 4 presents a novel Parallelized Network Traffic Classification framework based on Selective Ensemble and Diversity Measures (PNTC-SE-DM). Then experimental results are shown in Section 5. Finally, Section 6 provides a conclusion.
Related work
NTC is considered as the most fundamental functionality in modern network management and security systems. Some classifier ensemble-based techniques have been discussed for NTC since single-classifier-based techniques generally hardly achieved good overall performance. A new ensemble for fast prediction constructed by all base classifiers was investigated in [22], which was well adopted to new trends and patterns underneath data streams. Later, Zhang et al. described an aggregate ensemble learning framework to build an ensemble model [23], and then their results showed that the proposed method was suitable for accurate classification in noisy data streams.
In fact, even if classifiers can be accurately trained at a given time, their accuracy may be still degraded if the characteristics of the network traffic change. Moreover, based on dynamic traffic detection, Wang et al. [19] and Farid et al. [8] respectively built traffic classification systems, whose core techniques are all ensemble classification.
Notice that the classifier ensemble-based techniques usually utilize all base classifiers for classifying. It is reported that selecting partial base classifiers has more favorable classification performance than selecting all. For instance, it is widely accepted that diversity among base classifiers is pivotal to the success of ensemble learning system. Li et al. [14] introduced a novel diversity regularized ensemble pruning method (called DREP), which greatly improved the generalization capability of ensemble classifiers. Subsequently, Yin et al. [21] discussed a novel ensemble construction technique (called RotEasy), which is used to enhance the diversity between component classifiers.
On the other hand, with the fast development of Internet, the performance will be a big challenge in order to classify so giant network traffic. Many integrated parallel frameworks are applied in area of NTC. Especially, using packet-level properties in network traffic flows, Mu et al. [15] implemented a parallelized network traffic classification method. Moreover, Cai et al. [3] described a Hadoop-based traffic analysis system. At Cloud 2012, a novel class-based micro-classifier ensemble classification technique (MCE) for classifying data streams was described in [1], which sketches a cloud-based solution of class-based ensembles to handle a large number of classes. Later, an intelligent approach was presented to build a number of ADABOOST ensembles with MapReduce in [9].
Hadoop is an open source framework for large-scale data-intensive computing applications [2], mainly consists of the Hadoop Distributed File System (HDFS) and the MapReduce architecture, inspired by Google’s work. HDFS can split a large file into several blocks and save these blocks on multiple nodes. Indifferent to HDFS, MapReduce is a distributed programming paradigm for cloud computing environment introduced by Dean and Ghemawat [5], which is used to distribute the process job into several tasks and tracks execution of each task on computing nodes. MapReduce has two primary functions: the Map and the Reduce function. The Map phase generates the key-value pairs which are passed onto the Reduce phase. Here the Reducer groups all the values based on the key passed to it. That is, the MapReduce framework operates exclusively on <key, value> pairs where the input to framework is a set of <key, value> pairs and the output produced by the job is also another set of <key, value> pairs.
Classifier diversity measures
Dietterich [7] proposed the concept of classifier diversity measures for the first time, and indicated that the performance of classification could be improved as there was diversity among the classifiers. Diversity measures have become a hotspot in multi-classifier system. Kuncheva [12] summarized 10 kinds of diversity measures including the Q statistic, the correlation, the disagreement, the double fault etc. According to the research of the above methods, they have approximate accuracy, and the diversity measures is widely used to measure the diversity among classifiers because of its good understandability and stability.
According to the diversity measures, the diversity between two classifies
In Eq. (1),
In order to choose proper classifiers, it is necessary to compare one classifier with the others, and get the diversity value. It can get matrix D by measuring diversity between m (
Compare each diversity value in matrix D with the average value, and select the classifiers with higher diversity value to participate in ensemble. The comparisons times are
In order to reduce the comparison times, we make a further improvement of this method.
The matrix D can be optimized as following:
The strategy of selecting classifiers is: selecting classifier
PNTC-SE-DM with MapReduce
PNTC-SE-DM combines the idea of optimal selection and poor elimination. Firstly, the base classifiers are selected with high classification accuracy by discarding the ones which have a negative impact on ensemble learning. Then the related or redundant base classifiers are removed, and the rest of classifiers used for ensemble have better generalization ability and greater diversity. Eventually, the prediction result is attained by majority vote. The classifiers show high reliability and scalability under the distributed framework of Hadoop. In addition, the Map stage achieves to select base classifiers with high accuracy, while the Reduce stage achieves to select the base classifiers with great diversity. The MapReduce architecture of PNTC-SE-DM is shown in Fig. 1.

The MapReduce architecture of PNTC-SE-DM.
The architecture of PNTC-SE-DM contains two stages: the Map and the Reduce stage, and it is described as follows.
In Map stage, firstly, the Map function receives <key, value> pairs handled by HDFS, training base classifiers with decision tree algorithm and testing base classifiers with a test set, then drawing the corresponding prediction vector. Secondly, setting up the accuracy threshold λ, then selecting base classifiers

Map_Algorithm
In Reduce stage, the Reduce function sums up the outputs generated from all of Map tasks. Firstly, calculate the value of diversity between classifiers by predictive vectors. Secondly, get improved diversity matrix

Reduce_Algorithm
After the Reduce job, we have got basic classifiers with high accuracy and great diversity to integrate.
In this section, we first describe the experimental environment and the dataset used in our experiment, and then present experiment method and our analysis.
The experimental environment and dataset
Experimental environment: Hadoop is composed of four nodes on our cloud platform. Two of nodes respectively serve as Namenode/Datanode, Secondary Namenode/Datanode, meanwhile the others serve as Datanode. The Namenode/Secondary Namenode maintains the file system tree and metadata for all files. Also, they are responsible for e task scheduling of MapReduce. The Datanode is a work node of file system, which mainly is involved in the implementation of storage and computing. The configuration information of node is as shown in Table 1.
Node configuration
Node configuration
Dataset: The dataset of [4] is used in the experiment, and it is called Moore-set. The dataset increases by a factor of six up to a total of 2,265,156 samples of network flows, and it is divided into 10 classes.
We evaluate the performance of our proposed algorithm with respect to speedup, sizeup and accuracy in this section, the following subsections discuss these measurements and the results. In order to prove the property that MapReduce programming model is suitable for the process of massive data, this paper compares the performance of PNTC-SE-DM in a clustered environment with that in a stand-alone environment.
Speedup. In order to measure the time performance improvement of the proposed algorithm more precisely, speedup is used to describe the acquired performance due to less runtime of the parallel algorithm. Speedup is expressed as follows:
Figure 2 describes the speedup for different number of cores (8/16/24/32). From the results obtained, with increasing of the number of cores, the speedup performs better and better. In addition, undesired speedup is exhibited with small datasets. What is more, as the size of data set is 0.9 million flows, the performance of speedup is optimal. It is clear that MapReduce parallel programming model remarkably improves the efficiency in the classification of massive network traffic.

The curve of speedup.
Sizeup. Sizeup usually measures the capability of the parallelism to handle growth. It evaluates how much longer it takes to execute the parallel tasks, when the data size increased by a factor of m. Sizeup can be expressed as following:

The curve of sizeup.
Accuracy. In order to verify the generalization ability of PNTC-SE-DM, five different kinds of data sets (from testset1 to testset5) selected from the test set are served as the test objects, and a comparison of PNTC, PNTC-SE and PNTC-SE-DM is described in Fig. 4. Note that the classification accuracy of PNTC is poor since all base classifiers are in ensemble, and therefore some redundant base classifiers have a bad effect on classification results. On the contrary, PNTC-SE removes some redundant classifiers and improves the classification accuracy significantly. In fact, the PNTC-SE-DM curve is more stable and excellent than the other two (PNTC and PNTC-SE). Obviously, compared with PNTC or PNTC-SE, PNTC-SE-DM has better adaptability for dealing variety of datasets, and achieves better generalization ability.

The curve of accuracy.
In this paper, a new parallel implementation scheme of PNTC-SE-DM is discussed, which utilizes MapReduce programming model to design and achieve via Hadoop. The experimental results show that execution efficiency of the algorithm has direct relationship with the cluster scale if the data size is same. Moreover, on the premise of ensuring the accuracy, the more cores are used, the more significant speedup and sizeup are. Beyond that, the results declare that PNTC-SE-DM achieves better generalization ability than PNTC and PNTC-SE. In high-speed complex network environment, an interesting research topic is to realize real-time classification by PNTC-SE-DM.
Footnotes
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61163058 and 61363006), Guangxi Key Laboratory of Trusted Software (No. KX201306), and Guangxi Colleges and Universities Key Laboratory of Cloud Computing and Complex Systems (No. 14104).
