Abstract
Dynamic environment data generators are very often in real-world that produce data streams. A data source of a dynamic environment generates data streams in which the underlying data distribution changes very frequently with respect to time and hence results in concept drifts. As compared to the stationary environment, learning in the dynamic environment is very difficult due to the presence of concept drifts. Learning in dynamic environment requires evolutionary and adaptive approaches to be accommodated with the learning algorithms. Ensemble methods are commonly used to build classifiers for learning in a dynamic environment. The ensemble methods of learning are generally described at three very crucial aspects, namely, the learning and testing method employed, result integration method and forgetting mechanism for old concepts. In this paper, we propose a novel approach called Age Decay Accuracy Weighted (ADAW) ensemble architecture for learning in concept drifting data streams. The ADAW method assigned weights to the component classifiers based on its accuracy and its remaining life-time in the ensemble is such a way that ensures maximum accuracy. We empirically evaluated ADAW on benchmark artificial drifting data stream generators and real datasets and compared its performance with ten well-known state-of-the-art existing methods. The experimental results show that ADAW outperforms over the existing methods.
Introduction
In real-world applications, many data sources generate data streams, which changes continuously with time due to the change in their underlying data distribution [1, 2, 3, 4, 5, 6]. The dynamicity in the real-world data sources is very common due to the impact of seasonality or periodicity on the sensors responsible for generating data streams [1, 2, 3, 4, 5, 7, 8, 9]. The data streams generated in the dynamics environment are generally characterized by the properties of 3Vs in big data, i.e. volume, velocity, and variety [10]. The high-speed data generation and the requirement of quick real-time response to the data streams classification make the classical single classifier approaches very inappropriate in this situation [11]. In contrast to the classical classification approaches used so far for classification in a static environment, where the classification algorithms are rarely restricted on time of learning and testing instead of accuracy, the data streams mining imposes many new necessities on algorithms such as high-speed learning and limited memory usage [12]. Furthermore, in data streams, the non-stationary underlying data distribution often causes the evolution of new concepts from time to time. Such evolution of new concept in the data stream is also called concept drift [7, 13, 14]. The use of a classical single classifier approach for the classification of data streams with drifting concepts may lead to poor classification accuracy with time [15, 16].
Based on, how the feature values demonstrating the occurrence of concept drift changes in the data streams, one can categorize the concept drifts in various types, for example sudden, gradual, recurring and incremental [17, 18]. Figure 1 depicts some common kind of concept drifts occur in data streams.
Types of concept drifts.
A good data stream classification algorithm is supposed to perform equally well in the presence of any type of drifts. In recent years, many approaches have been proposed that are exclusively designed to deal with the classification of concept drifting data streams [9, 19, 20, 21, 22, 23, 24]. Some recently proposed methods for dealing with the mining of concept drifting data streams have utilized some common approaches such as windowing (both sliding and adaptive), ensemble methods [25] and drift detection strategies [9, 26].
Possible points of modifications in concept drifting data stream mining algorithms.
The desired properties of a good classification system for mining high-speed, voluminous and time-varying, open-ended data streams have been discussed in [12], which includes the properties such as constant time constraints, limited primary memory requirement, one scan learning, and adaptive towards the changes in the underlying data distribution.
In general, the ensemble-based approaches designed for mining data streams with concept drift mainly targets the following five possible points of modifications: ensemble architecture, learning, and testing strategy, drift detection methods, forgetting mechanism and result integration rules as shown in Fig. 2. In this paper, we propose an ensemble-based architecture and algorithm to organize the components of the ensemble at all five levels. We empirically evaluated our designed model on some benchmark datasets and compared its performance with some well-known existing ensemble methods. The results of our ensemble model outperform the existing methods.
The rest of the paper is organized as follows. Section 2 describes the concepts related to data streams mining and concept drifts. Section 3 presents the literature survey of some pioneering work done so far for mining of concept drifting data streams. In Section 4, we present our proposed novel approach for handling concept drifting data streams. Section 5 provide a description of data collection and experiment evaluation of our model with some existing algorithms over some synthetic and real datasets. Finally, Section 6 and 7 present the conclusion and future scope of the work.
The problem of a standard non-streaming dataset classification task can be explained as follows: let
In contrast to the non-streaming dataset, a data stream
Generally, an incremental learning strategy is recommended for data stream mining [27, 28]. In incremental learning approach an incremental learner
Concept drifts
The occurrence of concept drifts in a data stream can be better understand by using Bayesian interpretation or Bayesian decision theory as presented by Kelly et al. [15] and explained using Eq. (1). Kelly et al. explained that the change in population distribution could occur in three ways, which may result to concept drift.
(i) The class priors
(ii) The distributions of one or several classes
(iii) The posterior distributions of the class memberships
Abstractly, the mining process of concept drifting data streams can be considered as a two-step process: (i) drift detection method and (ii) taking corrective actions to neutralize the adverse effects on the classification accuracy due to the drifts. Mandatorily, both these steps should qualify the time and space constraints generally required for data stream mining. Some commonly utilized approach for drift detection in data streams includes the monitoring of change in the underlying change in the data distribution, monitoring the change in the class conditional probabilities or monitoring the average accuracy drop with time. After drift has been detected, certain corrective actions are required to cope up with the increased misclassification errors. These corrective actions include updating, rebuilding or dropping of the classifiers depending on the used strategy.
Based on the various approaches proposed in some recent pioneer work done [12, 25, 26, 29, 30], we can abstractly categorized the methods for handling concept drifting data streams into three major categories as (i) windowing-based methods [12, 30] (ii) methods based on separate drift detection strategy [26, 29] and (iii) ensemble-based methods [25, 31, 32]. While focusing on these major identified categorizations, in this section, we present discussions on some related work under these categories.
Windowbased strategies
Window-based strategies are very common in incremental learning of data streams [27]. Mainly sliding window-based approaches are very popular because of their inherited tendency of always keeping recent training instances limited to the window size [33]. However, the deciding of window size is a kind of trade-off. Small window sizes are fast in the change detection but they may suffer from false change detection. Similarly, large window sizes are slow in the change detection but are more accurate to change detection. To overcome the slow response of a large window approach toward the change in data distribution, various adaptive windowing strategies are proposed [30, 34]. Depending on the size selection strategies, the window-based strategies can be further categorized into fixed-sized and adaptive window-based strategies. In an adaptive windowing strategy, the size of the window is adjusted automatically depending upon certain measures like rate of change or any other statistical measures. Kifer et al. [35] proposed a fixed-sized window-based approach for the detection and estimation of changes in data streams. In this work, a novel distance measures between distributions has been introduced to compute the change in distribution between reference and current window. Klinkenberg et al. [36] proposed the Support Vector Machine (SVM) based concept drifts detection method for the data stream. This approach uses adaptive windowing to reduce the estimated generalization error. The estimated error is computed using
The window-based learning strategy can be used in either of two ways for data stream mining: As an external mechanism in a learning model for detecting the change or as an integral part of the learning algorithm. The Very Fast Decision Tree (VFDT) of Hoeffding Tree (HT) [12] an incremental learning anytime decision tree induction algorithm. VFDT is a decision tree based on Hoeffding bounds. The VFDT can learn from a large data stream. It does not store any example in the main memory and the memory requirement of VFDT is proportional to the size of the decision tree. It performs one pass learning that’s why it does not requires to storing the example ever. To achieve one-pass learning and low memory utilization, the VFDT utilizes the Hoeffding bound. The Hoeffding bound is stated as follows. Let
The Hoeffding bound is independent of the probability distribution, which is generating the observation. The major shortfall of VFDT is that it is designed for stationary massive data streams. Later, to cope up with the concept drifting data streams, the VFDT was modified to CVFDT (Concept-adapting Very Fast Decision Tree learner) algorithm [37]. The CVFDT is a window-based algorithm that is capable of mining time-changing data streams. To handle drifting data streams, CVFDT grows alternate subtree with the recent best attribute and replaces the old subtree when the alternate subtree one becomes more accurate. Based on the accuracy of the alternate subtrees computed on some examples, the low accuracy subtrees are pruned to maintain the low requirement of memory. The CVFDT dynamically shrinks the window size when many nodes get questionable or data rate changes rapidly and increase the window size when only few nodes are questionable. Both, the VFDT and the CVFDT attain the same accuracy.
Adaptive Windowing (ADWIN) [30] provides an approach that deals with drifting data streams. The ADWIN is also a sliding window-based algorithm that dynamically changes the window size depending on the online computation of the rate of change observed from instance within the window itself. The algorithm automatically increases the window size when no change in data is observed and shrinks window when data changes. By adapting the window size, the ADWIN manages an optimal trade-off between small and large window size dynamically.
Hoeffding Adaptive Tree (HAT) [38], is a variation of HWT (Hoeffding Window Tree) that adaptively learn from time-changing data streams with having constraints of the fixed sized sliding window concepts. The solution of the optimal window size is not an easy deal in streaming data due to the varying rate of change of frequency distribution in data stream. Instead of depending on a size parameter, the HAT manages certain statistics at the nodes of Hoeffding Tree (HT) to estimate the frequency statistics at every node. Three variations of HAT have been proposed in [38], which include HAT-INC, HAT-EWMA and HAT-ADWIN. The HAT-INC is based on based linear, HAT-EWMA is based on exponential weight moving average and HAT-ADWIN is based on ADWIN estimator. The HAT has two main advantages over HWT, first, in HAT, there is no need to calculate optimal window size and second HAT keeps all data in main memory instead of keeping it in secondary memory as HWT does. The Hoeffding Option Trees (HOT) [39] provides additional optional nodes on HT, which allows several tests on HT. This leads to many HTs as a separate path. HOT is consisting of single tree structure that can efficiently represents many trees.
There are two very fundamental approaches in mining time of varying data streams: (i) practicing regular periodic updates in the learning model without considering about the changes and (ii) first detect the change and then adapt the learning model accordingly. In the former approach, weighted examples and time windows of fixed size are utilized. For instance, the higher weights can be given to recent examples [36, 40]. In a time window-based approach only those instances are included for training which are included in the window. The later approach is based on certain measures like monitoring the change in accuracy or change in the underlying distribution, etc. Such concept changes are monitored regularly and if a concept drift is detected the learning model is adapted accordingly to cope up with the adverse effects of concept drifts. This approach is also called drift detection-based methods. The performance measures of drift detection methods include the following measures [32]:
Drift detection delay: The time delay between actual drift detection time and drift detection time by the detector.
Number of true positive drift detection: Total number of detected drifts, which are actual drifts.
Number of false positive drift detection: Total number of detected drifts, which are not actual drifts.
Drift detection accuracy: Ratio of actual drifts with a total number of drifts detected.
For good drift detectors, drift detection delay should be low and accuracy should be high. In the group of drift detectors, the Drift Detection Method (DDM) [29] and Early Drift Detection Method (EDDM) [26] are very popular. The DDM is based on the fact that the prediction error can be approximated through a binomial distribution. To track the error rate in DDM, certain alarm values are decided for detecting the changes. Once the alarm is reached, the older classifier is dropped and a new classifier is learned. The DDM is good in detecting sudden drifts but not well for gradual drifts. Furthermore, DDM is not so quick in drift detection. The EDDM is a kind of variant of DDM, in which the frequency of misclassified instance is considered for quick detection of drifts. The EDDM is comparatively well for gradual drifts as compared to DDM.
A variety of machine learning tasks like classification [41], clustering [42, 43] and reinforcement learning [44] have been approached by ensemble methods. A large group of researchers has shown their affinity in the ensemble-based methods for data stream mining with concept drifts [9, 19, 25, 31]. Test-then-train incremental methods and bagging [45] are commonly used for learning in data streams. Online bagging is an online version of incremental learning in which the member classifiers are learned incrementally and the decision are integrated using majority weighting. Leveraging Bagging (Lev) [46] is a variant of bagging that adds randomization to the input and output of the classifiers with a simple bagging strategy. An online version of bagging and boosting is presented by Oza (OzaBag and OzaBoost) [47] that is based on one pass learning. The Weighted Majority Algorithm (WMA) [48] method is based on the pool of classifiers, where each classifier are called an expert. Initially, each classifier is assigned a weight of one and whenever any expert makes the wrong prediction, its weight is multiplied by fixed
Where, the probability of classifier
Where
ADAW architecture.
The architecture of an ensemble method of classification explains the interaction among the component classifiers within the ensemble. Various ensemble architectures can be categorized based on their topologies. Heitor et al. [52] attempt to classify the various ensemble architecture under the following major categories: (i) Flat (ii) meta-learner (iii) hierarchical and (iv) network. The architecture of the proposed ADAW method is given in Fig. 3. Our proposed ensemble architecture and its organization is based on the certain modifications of the three very important aspects of the commonly used strategy of ensemble building [9, 13, 22, 53, 54, 55], that is, how component classifiers are virtually arranged, how the component classifiers are participating in decision making, and how the learning is performed. These modification phases are depicted in Fig. 2. Based on the standard practices involved in the designing of an ensemble of classifiers for online learning in drifting data stream, we illustrate our approach under the following categories: (i) The structure of ensemble ADAW (ii) learning and testing strategy used in ADAW (iii) method of weight assignment to the component classifiers and decision integration rule and (iv) forgetting mechanism. The complete ADAW algorithm is also illustrated in Algorithm A.
Array-based structure and decision integration rule of ADAW.
Let
Learning and testing strategy of ADAW
Batch setting approaches such as holdout and K-fold-cross-validation are commonly used for the supervised learning and testing of classifiers using static datasets [56]. In batch learning, the training dataset is divided into two sets, one part is used for training the model and the other is used for testing the model. The K-fold cross-validation offers certain modifications to the holdout method. In K-fold cross-validation the labelled dataset is divided into
The one-pass learning requirements of data streams limit the scope of these approaches in learning and testing of classifiers used for data stream classification. Instead of using holdout or k-folds, in ADAW, we followed two separate procedures, one for the training and testing of every newly inserted component classifier and one for only testing of the already existing component classifiers in the ensemble
Test-then-train: All newly introduced component classifiers are trained in test-then-train fashion. In this approach of learning each coming instance is first tested on the newly inserted component classifier and if the classifier made the wrong prediction for the instance then only learning is performed [57].
Test-only: Instead of performing training, this procedure merely evaluates the accuracy of the existing component classifiers on the recent data chunk. Therefore, instead of learning, all existing component classifiers in the ensemble only perform prediction for the coming new instances and their accuracies are updated accordingly for every data chunk.
The learning and testing strategy of ADAW makes it different from AUE, in AUE the test-then-train has been applied to all existing component classifiers for all recent data chunks. The intention of having two different procedures of learning and testing strategy for newly created component classifier and the existing old component classifiers in ADAW is to preserve some outdated concepts with the ensemble. By doing this, the ADAW maintains a kind of inherited diversity within the ensemble due to which it performs well in the presence of recurring drifts.
The weight assignment to the component classifiers in an ensemble of classifiers is a very crucial step because weights of the component classifiers play a very important role in the decision integration rule of the ensemble. The weight assignment function of ADAW is based on the two very simple concepts. The first concept is based on the consideration of the temporal and spatial locality of reference and the second is based on the observation of the Wang et al. [51]. According to Wang et al., if an ensemble of classifiers
The temporal and spatial locality of reference consideration is one of the most commonly used strategies for estimating the distribution of newly arriving instances. To ensure this consideration in ADAW, the newly introduced classifiers always assigned the highest index value
The weight
Where the accuracy
The ADAW simply utilizes the weighted sum as given in Eq. (10) for the decision integration of all the component classifiers in the ensemble for predicting the class of an instance. The weighted sum approach used in ADAW is depicted in Fig. 4, where
The forgetting of outdated concepts is a very common practice in online data stream classification for limiting the memory requirements. Generally, in the single classifier approach, the forgetting of the old concepts is achieved by simply dropping the old classifier and rebuilding a new classifier trained on the recent data chunk. However, the ensemble methods deal with forgetting by dropping of the least accurate classifier component from the ensemble [19, 51]. Based on the same principle, the ADAW also drops the one component classifier from the ensemble to maintain low memory requirements whenever the ensemble attain its maximum size (i.e.
Data collection and experimental setup
The proposed algorithm ADAW is evaluated for many experiments comprised of different types of drifting data streams. Furthermore, the performance of ADAW is also compared with ten well-known state-of-the-art existing methods. In the following sections, first, we present a discussion on datasets and then we present the experimental setup.
Dataset characteristics
Dataset characteristics
The performance of the proposed algorithm is empirically evaluated on a few artificial and real datasets for accuracy and CPU time elapsed. For the analysis of the proposed algorithm on various types of concept drifts, we prepared a group of artificial datasets, which includes sudden drift, recurring drift, incremental drift and gradual drift as shown in Table 1. We used Massive Online Analysis (MOA) [58] framework for generating synthetic real-time data streams. A brief description of the theses generated data stream is given below.
Rotating Hyperplane (Hyp): Hyperplane 3̧7 is the data stream generator that generates data streams with incremental drifts by simply rotating the decision boundary with every successive instance. The hyperplane generator is one of the most popular generators used for the analysis of data stream mining algorithms. There are 10 attributes in this dataset. We have utilized the hyperplane generator to generate 1,000,000 instances with 5% noise and contains incremental drifts with magnitude changing by weight of 0.001 for each instance.
Streaming Ensemble Algorithm (SEA) Concepts Generator: The SEA [50] Concept Generator generates points between 0 and 10. In this generator, three attributes are utilized for generating points, however, out of the three attributes, only the first two attributes are relevant. The SEA concept generator generates sudden drifts. The generated points are grouped into four different blocks, where each block represents a different concept upon a threshold value. The threshold value is the upper bound of the sum of the first two attributes. For performing the experiments, a total of 1,000,000 instances are generated with three introduced drifts at instances 250K, 500K, and 750K and with 10% noise.
Random Radial Basis Function (RBF): The Radial Basis Function generates a numerical dataset in which classes are represented by the drifting centroids of the hypersphere. Each center is designated by a random position, weight, class and a single standard deviation that generates gradual drifts. By using this generator, we generated a dataset with 1,000,000 instances having 10 attributes. This dataset contains two gradual concept drifts at instance position 400K and 800K, where each concept holding four decision classes.
Random Tree Generator (RTG): The Random Tree Generator generates a data stream with recurring drifts. For experiment purposes, we have generated RTG data stream of 1,000,000 instances that contains 5 numerical and 5 nominal features. Three drifts are introduced at instance positions 300K, 600K, and 900K.
Light Emitting Diode (LED) dataset: The LED generator [59] used for generating the LED dataset. The LED dataset is an artificial data stream that contains 24 binary attributes which are used to denote the digits in seven segment display constituting 10 classes. The LED data stream generator is used to generate two variants of LED type datasets, the LED dataset, and LED-Drift dataset. The LED dataset contains 1,000,000 instances with no drifts and with relatively high noise of 20%. The LED-Drift dataset also has 1,000,000 instances but with three introduced drifts at 300K, 600K, and 900K that constitute a mixed kind of drifts with 10% noise. A mixed kind of drifting data stream is challenging to learn.
Forest Cover Types (Cov): The Forest Cover Types (Cov) [60] dataset contains information on wilderness areas of the US obtained by US forest service. It is widely used for classification purposes. The Forest Cover Type dataset contains 7 classes of forest cover types and having 54 attributes constituting 12 measures that include 4 binary wilderness areas, 40 binary variables for soil type and 10 quantitative. The dataset is having 581,012 instances.
Poker Hand (Poker): The Poker Hand (Poker) [61] dataset is consisting of a hand containing five playing cards. Each card is described by two attributes rank and suit for 10 predictive attributes. The class attribute is described by “Poker Hand”. This dataset is consisting of 1,025,010 instances. It is widely associated with classification tasks.
Experimental setup
This section presents the comparative study of ADAW with ten well-known state-of-the-art existing methods of data stream mining. We utilized prequential accuracy and learning time for comparing the performance of all other methods with ADAW. We utilized MOA framework for comparing ADAW with the existing algorithms. All these algorithms are implemented in java and are part of MOA framework. We implemented our approach using JAVA.
The experiments were carried out on an octa-core i7 processor of frequency 1.8 GHz and 16 GB RAM. To make the comparative analysis, we utilized the same parameter values for all algorithms. For instance, we used training chunk size
Based on the crucial characteristics of streaming data, we evaluated the performance of the algorithms on their accuracy and the CPU time elapsed. We used a test-then-train strategy to train the newly inserted component classifiers by using a fixed-sized data chunk of size
The selection of the maximum number of component classifiers in an ensemble is a very crucial decision [62]. However, both the large and small number of component classifiers in an ensemble is having their own advantages and disadvantages. The ensembles that are having a large number of component classifiers are beneficial in the presence of recurring drifts in data streams; however, having the large number of component classifiers has high memory requirements and often contains outdated concepts. Similarly, for those data streams, which contain frequent sudden drifts with evolving new concepts, the large size ensemble is worthless. Although, the small size ensemble is good in terms of memory requirements, however, in this case, the data streams with recurring drifts the ensemble can produce low accuracy. The optimal situation must be like that a small number of component classifiers in an ensemble might be able to produce maximum accuracy for all concepts present in the data stream [62].
Comparative study and result analysis
We have conducted a set of experiments to compare the performance of ADAW against AUE, AWE, DDM, DWM, HOT, Leveraging Bagging (Lev), Naïve Bayesian (NB), Oza, VFDT and WMA. In order to have the representation of all major categories of approaches that we have discussed in related work section, we have intentionally selected such set of data stream classification methods, we selected AWE and AUE as we have considered them as a base for our proposed method, which is also based on ensemble approach and follows the accuracy-based weighting of the component classifiers. We have selected NB as it is a popular single classifier method, which does not exclusively employ any drift detection method. The HOT is selected as a prominent representative of the hybrid approach, which utilized both; the incremental learning and ensemble approach. The VFDT is selected as a potential representation of very fast incremental online learner. The WMA covers one of the best-known decision integration rules. Lev covers a variant of bagging strategy, whereas the Oza is a variant of online bagging strategy. Both Lev and Oza also follows the ensemble strategy. The proposed and the selected algorithms are evaluated on the two performance parameters: the average accuracies of the algorithms and the CPU time elapsed in performing the experiments.
Average classification accuracies in percentage [%]
Average classification accuracies in percentage [%]
CPU time elapsed [seconds]
Runtime accuracies of various classification methods on various datasets.
The average classifier accuracies and the CPU time elapsed in performing the experiments for all algorithms are shown in Tables 2 and 3 respectively. In addition to the average accuracy and the CPU time in tabular format, the run time prequential accuracies per chunk of the algorithms with processed instances is also depicted with help of graph plots as shown in Fig. 5a–h, for all utilized data stream generators. In the graphical representation, the X-axis represents the processed instances and the Y-axis represents the prequential accuracy achieved by the classifiers per data chunk.
The graphs in Fig. 5a depicts the prequential accuracies of all algorithms on hyperplane (HYP) data stream generator. The hyperplane data stream generator generates data streams with incremental drifts. From Table 2, it can be observed that the ADAW is leading over the rest of the classifiers with an average accuracy of 94.86 %. The ADAW is very closely followed by WMA, which has shown an average accuracy of 94.07%. The minimum average accuracy of 84.17% is shown by NB. As given in Table 3, the CPU time consumed by Lev is 1548.03 seconds which highest among all other evaluated methods. The ADAW consumed CPU time of 403.03 seconds, which is very large as compared to NB that has taken CPU time of 0.92 seconds only. In the presence of incremental concept drifts, the concepts keep changing very slowly within the data stream; therefore, each consecutive data chunk has high possibility of having a very low deviation in the feature values. Generally, such consideration of consecutiveness among the component classifiers can be inherently achieved by maintaining the order of insertion of new component classifiers in the ensemble. The ADAW maintains this generalization by utilizing the array-based ensemble structure. The array-based ensemble structure also maintains the temporal and spatial locality of reference for the underlying data distribution of the data stream. The obvious reason for the over the performance of ADAW on other methods is that the ADAW has utilized the remaining age factor
The graphs in Fig. 5b show the average accuracy of evaluated methods on the SEA data stream. The sea data stream contains sudden drifts. The ADAW performs well with an average accuracy of 89.32% as compared to other methods on the SEA data stream. For this dataset, the ADWA is followed by AUE and Lev, which gave an average accuracy of 88.26% and 88.31% respectively. The very often property of any classifier used for the classification of the data stream containing sudden drifts is that the accuracy of the classifier drops suddenly. This property can be easily verified by the graphs in Fig. 5b. An intuitive solution for caring for the presence of sudden drifts in the data stream can be obtained by always updating the classifiers with the recent data distribution of the data stream. In an ensemble approach, this requirement of routine updating can be satisfied by simply having the newly inserted component classifier trained over the recent data chunk. The ADAW fulfills this requirement by always adding the new component classier trained on the recent data chunk. In ADAW, the remaining age value
The graphs in Fig. 5c shows the runtime accuracies of all evaluated methods on the RBF data stream. The RBF dataset contains gradual drifts. For this dataset, two drifts are introduced at instance 400K and 800k. From the graph, it can observe that the accuracies of the evaluated classifiers can be partitioned in three fluctuating range of accuracies 50% to 75%, 80% to 90% and more than 90%. The AUE, Lev, and ADAW belong to the range having accuracies more than 90% with a highest average accuracy of 98.82% is given by ADAW that is sharply followed by AUE and Lev showing average accuracy of 98.03% and 98.72% respectively. A drop in average runtime accuracy can be observed at 300K for DWM, NB, AWE, DDM and at 800K for HOT and WMA. Similarly, a slight drop in prequential accuracies is also observed between 300K and 400K for AUE and Oza. However, the ADAW has shown quite a consistent accuracy throughout the stream. From Table 3, it can be observed that a very high CPU time of 1037.25 seconds is consumed by Lev as compared to NB which has only consumed the CPU time of 4.07 seconds. For the RBF dataset, the ADAW consumes a reasonable CPU time of 443.36 seconds and tries to maintain a kind of trade-off between accuracy and CPU time.
The graphs in Fig. 5d depicts the accuracy of all algorithms including ADAW on the data stream generated by the Random Tree Generator (RTG). The RTG contains recurring drifts. From Table 2 it can be observed that ADAW attained average accuracy of 93.49%, which is maximum for the given set of algorithms and slightly better than the average accuracy of Lev, which is 93.37%. The NB performed worst in the RTG data stream generator with an average accuracy of 58.37% only. As shown in Table 3, the largest CPU time of 1222.36 seconds is consumed by Lev which is very high, whereas ADAW consumed CPU time of 307.43 seconds. The least CPU time is consumed by NB which is about 0.81 seconds. From the graphs in Fig. 5d, it can be easily observed that ADAW is closely followed by AUE and Lev, however, the ADAW gained more accuracy over AUE at some points around 450K and 700K. This gain in the accuracy of ADAW is due to the property of ADAW that allows only newly inserted component classifier to be trained by the recent data chunk and restricts old component classifiers only for testing instead of learning on recent data chunk. By doing this, the ensemble preserves some outdated concepts with it.
The graphs in Fig. 5e depicts the accuracy of all algorithms on the LED data stream generator. For the LED dataset all methods show a relatively low accuracy as compared to other datasets. The accuracies for this dataset ranges between 70% to 75% only. Here, the ADAW outperforms with an average accuracy of 74.15%. The standard deviation of the average accuracies of all selected methods is very small. In LED, the ADAW is immediately followed by DDM and NB with an average accuracy of 74.13% for both. From Table 3, it can be observed that the DWM consumed the highest CPU time of 139.68 seconds and NB has consumed the lowest CPU time of 2.07 seconds as compared to all other evaluated algorithms. The proposed method, ADAW has consumed a considerably moderate amount of CPU time of 60.23 seconds.
The graphs in Fig. 5f are showing the runtime accuracies of the evaluated methods on the LED-Drift artificial dataset. The LED-Drift data stream generator generates data streams with mixed type concept drifts. With the mixed type of concept drifts, the LED-Drift simulates the real-world drifting data stream to a considerable degree. As it can be easily observed from the graphs that the evaluated algorithm can be classifies into two sets, the one set to include those methods which are performing relatively more accurately and consistent towards the introduced drifts at instances 300K, 600K, and 900K. This set includes DWM, Lev, ADAW, AWE, DDM, and AUE. The other set includes those methods which are relatively less consistent and accurate. This set include NB, WMA, HOT, VFDT, and Oza. From the graphs in Fig. 5f, a clear drop in accuracies can be observed for the algorithms belong to the later set. For the LED-Drift dataset, the best average accuracy is shown by ADAW which is 73.84%. Here, the ADAW performs slightly better than AUE, AWE, DDM, DWM, and Lev, which are showing the accuracy of 73.64%, 73.66%, 73.70%, 73.71% and 73.71% respectively. For this dataset, NB has shown the worst performance with 62.45% and it is most inconsistent too. From Table 3, it can observe that the highest CPU time of 132.46 seconds is consumed by DWM and the lowest CPU of 2.07 seconds is consumed by NB, however, the ADAW again consumed the moderate CPU time of 61.17 seconds.
The graphs in Fig. 5g show the prequential accuracies of selected methods and ADAW on the Poker-Hand (Poker) real dataset whose drift type is unknown. The selected methods for this dataset show very low average accuracies as compared to other datasets used for performing empirical evaluation in this work. The average accuracies ranged between 47% to about 50.5% only. Despite of low accuracies, the performance is quite consistent for all methods in the Poker dataset. The ADAW shows the best average accuracy with 51.39% and the worst average accuracy is shown by AWE with 47.61% however the standard deviation of the averages accuracies is very low for the Poker dataset. As given in Table 3, the Oza takes the highest CPU time of 54.94 seconds and the NB consumed only 1.90 seconds for training and testing, however, the takes only 19.37 seconds of CPU time and provides the best accuracy for Poker dataset.
The prequential accuracies of analyzed methods over the Forest Cover Type (Cov) dataset are shown by the graphs in Fig. 5h. The Cov dataset is among the prominent datasets used for evaluating massive data mining methods. For the Cov dataset, the best accuracy is shown by Lev with 92.06% which is slightly better than ADAW that has shown the average accuracy of 91.83%. The NB is the worst performer with an average accuracy of 61.07%, which also performs very inconsistently as its runtime accuracies range from 41% to 80%. From Table 3, it can be noted that the Oza consumed the highest CPU time of 159.07 seconds and for NB CPU time of 4.37 seconds is elapsed, whereas a moderate amount of 63.17 is only consumed by ADAW.
Statistical analysis
In order to extend our analysis, we performed statistical analysis to get clearer insights into the experimental results. We performed nonparametric related-samples Friedman’s two-way analysis of variance by ranks to observe the statistical significance of accuracies and the CPU time of the experimented algorithms over different datasets. We started with the null hypothesis
Statistical observation of related-samples friedman’s two-way analysis of variance by ranks
Mean ranks of evaluated algorithms
Table 4 shows the statistical observations of related-samples Friedman’s two-way analysis of variance by ranks. As given in Table 4, the test statistic (Chi-Square) values are 36.52 and 64.93 for accuracies and CPU Times of the different algorithms respectively. Considering the significance level
Table 5 shows the mean ranks of evaluated algorithms for average accuracies and CPU time. Note that for accuracies the higher mean rank is better and for CPU time the lower mean rank is better. From Table 5, it can be noted that ADAW holds 1
Related-samples friedman’s two-way analysis of variance by rank for accuracies of evaluated algorithms.
Pairwise comparisons of different algorithms for their accuracies.
In addition to Friedman’s test, we also applied the Bonferroni test to statistically perform the multiple comparison test among the evaluated algorithms. Figure 7 shows a graphical representation comparison table that we received after applying the Bonferroni test. This represents depicts the pairwise comparisons of different algorithms for their mean ranks with respect to the accuracies. In the graph of Fig. 7, each node shows the mean rank of evaluated algorithms, the blue edges represent those pairs of algorithms for which there are statistically significant differences among the pairs and the red edges represent those algorithms pairs for which the statistical difference are nonsignificant. On applying the Bonferroni test, we observed that the ADAW performs significantly better than OzaBag, VFDT, NB, and AWE, however, the statistical significances are not enough to draw the same conclusion for rest of the algorithms. We also performed the same kind of statistical analysis for CPU time elapsed by all the algorithms and observed that ADAW is significantly faster than Lev and Oza. For the rest of the methods, the differences are nonsignificant to reach the same conclusion.
In this paper, we proposed a novel method called Age Decay Accuracy Weighted (ADAW) for the classification of concept drifting data streams. The ADAW follows an array-based ensemble architecture, where the weights to the component classifiers are assigned and updated based on the remaining age and accuracies for the recent data chunk. The novelty of the ADAW is its architecture and its weight assignment mechanism for the component classifiers. We empirically evaluated the performance of ADAW with ten existing state-of-art classifiers over six large artificial datasets and two large real-world datasets comprised of recurring, gradual, sudden and mixed kinds of drifts. The results show the ADAW performs very well for all kind of drifts and over performs most of the algorithms that we have evaluated. While performing empirical evaluation for different algorithms and datasets, we observed that the ensemble-based methods suffer from accuracy drop in the presence of sudden concept drifts and the drift detection-based methods are less sensitive to the gradual drifts. We also observed that ensemble-based methods are more robust towards abrupt changes as compared to single classifiers. As future work, we are planning to explore the opportunity of adaptiveness in ADAW where we are used a fixed size data chunk. We believe that, by having an adaptive data chunk size, we can reduce the average learning time sufficiently.
