Abstract
With the rapid development of information technology, data streams in various fields are showing the characteristics of rapid arrival, complex structure and timely processing. Complex types of data streams make the classification performance worse. However, ensemble classification has become one of the main methods of processing data streams. Ensemble classification performance is better than traditional single classifiers. This article introduces the ensemble classification algorithms of complex data streams for the first time. Then overview analyzes the advantages and disadvantages of these algorithms for steady-state, concept drift, imbalanced, multi-label and multi-instance data streams. At the same time, the application fields of data streams are also introduced which summarizes the ensemble algorithms processing text, graph and big data streams. Moreover, it comprehensively summarizes the verification technology, evaluation indicators and open source platforms of complex data streams mining algorithms. Finally, the challenges and future research directions of ensemble learning algorithms dealing with uncertain, multi-type, delayed, multi-type concept drift data streams are given.
Overview
Traditional machine learning algorithms mainly deal with steady data. Now data appears in the form of streams, which has changed the way people store, communicate and process data. In many new applications, people are faced with constantly changing data streams such as web clickstream analysis, network logs, traffic data monitoring and management, telecom data management etc. [1]. As researchers study data streams, more complex data streams types have emerged, but satisfactory performance cannot be obtained when processing data such as imbalanced, high-dimensional and noisy data. In this context, how to effectively construct an efficient knowledge discovery and mining model has become an important topic in the field of data mining.
Ensemble learning is a research hotspot that aims to data fusion, data modeling and data mining into a unified framework. The classic data streams ensemble classification algorithms including Online Accuracy Updated Ensemble Algorithm (OAUE) [2], Leveraging Bagging (LevBag) [3], etc. there are Streaming Ensemble algorithms(SEA) that deal with concept drift algorithm [4], Dynamic Weighted Majority (DWM) [5], etc. Ensemble algorithms deal with imbalanced data including SMOTE Boost [6], Resample-based Ensemble Framework for Drifting Imbalanced Stream (RE-DI) [7]. Ensemble learning process Big data streams including Vertical Hoeffding Tree(VHT) [8] and HSMiner [9] algorithms. These are some classic ensemble algorithms for complex data streams.
This article will conduct a careful analysis of the complex data stream classification algorithm and compare the data streams classification overview of previous researchers. Lemaire et al. [10] introduced the data streams learning method in the review literature, using single classifier to process the data streams as the main method. The paper also briefly introduced common data sets and evaluation methods. The review of Gomes [11] is a very comprehensive review on data streams ensemble classification from the combination of classification to diversity and dynamic update, it gives a detailed introduction mainly for the concept drift data streams. Krawczyk et al. [12] gave a detailed introduction from steady-state data to dynamic data stream ensemble classification to regression. The paper also mentioned some advanced issues of data stream for discussion. Although the above overview of data streams introduces common data streams processing methods, most of the types of data streams that are not subdivided are still dealing with concept drift as the main research object. At the same time, evaluation techniques and indicators of data streams are only brief mention and no summary. The frame structure of this paper is shown in Fig. 1. And the main contributions are as follows: For the first time, this article comprehensively summarizes the complex types of data streams that people have studied in recent years and compares the proposed ensemble algorithms. At present, there is no overview to summarize the ensemble algorithms of various types of data streams from the perspective of ensemble methods. The article also summarizes the development status of field text, graph and big data streams ensemble algorithms in recent years. This article first comprehensively summarizes the evaluation techniques and evaluation indicators used in the experimental analysis of complex data streams for a summary and comparison. The current review does not provide a comprehensive overview of the evaluation methods.

Comprehensive classification framework for complex data streams.
In recent years, in the research field of data streams mining, great progress has been made in obtaining useful models from a large amount of rapidly generated data and traditional static batch processing methods have gradually become unsuitable for non-static learning environments. Transition from traditional steady-state data to dynamic data, the processing of data in a non-static environment means that new requirements are put forward.
Data streams
A data stream is a potentially unbounded, ordered sequence of data items, arriving over time. Under the framework of fully supervised learning, let S = {x1, x2, …, xt-1, x t , xt+1, …} represent attributes value at time t, and y t is at time t label. The goal of data streams classification is to train the classifier and establish the mapping relationship between the feature vector and class label. Most of the existing streaming data mining solutions are carried out under the assumption that the data is stable. However, in the real world, the generation of data streams is usually carried out in a non-stationary environment, which means that the underlying distribution of data can be changed arbitrarily over time [13]. In a non-static environment, more complex data representations will be encountered. Imbalanced data, large-volume data, multi- instance data and multi-label data will appear in the data streams. Next, we will briefly discuss the challenges of these data streams.
(1) Concept Drift
Nowadays, data appears in the format of a data streams instead of observing data in the format of a static data set in a static environment. In such a non-stationary environment, the concept of data and data distribution change over time, this phenomenon is called concept drift [14]. The following will define the concept drift from the perspective of probability. When the joint probability of two time points t
o
and t1 changes, there will be concept drift, namely Equation (1).
According to the speed of drift change, it can be divided into sudden change, incremental, gradual and recurring drift as shown in Fig. 2. If one source of data distribution is replaced by another source of data distribution, it is called sudden type. When you begin to observe the mixture of the two data distributions, this drift is called incremental type. When the data instances of the old data distribution begin to decrease and the data instances of the new data distribution begin to increase, the scene is defined as a gradual type. recurring drift refers to a situation where instances of new concepts or instances of old concepts begin to recur after a certain time [15].

Concept drift types.
(2) Other Challenges
The real data streams only have concept drift and even class imbalance. A typical example of this situation is technical diagnosis, where the probability of failure increases with the use of time. Sometimes the relationship between the minority class and the majority class changes and the former minority class becomes the majority class [16]. In the application of social media, such as the popularity of Twitter discussion topics. In reality, a small number of categories are often the focus of research.
Non-standard data is mentioned in the literature [12]. In recent years, non-standard data and class structure have attracted more and more attention from the machine learning community. In data streams mining, multi-label and multi-instance learning is still a largely untapped field. Multi-label data is involved in data with multi-label data streams in many real-world applications. The multi-label data streams is a stream that has the same attributes as the multi-label data. Typical multi-label data includes news articles, medical text, etc [17]. Simultaneous multi-instance data is a field of little attention in the streaming mining environment. Most of the work in this field is concentrated in the fields of biological information, image retrieval, acoustic classification, and online video processing.
In the era of big data, with the development of computer science and Internet technology, big data streams are exploding at an exponential rate. According to statistics, Google processes more than hundreds of petabytes of data every day, Facebook generates more than 10 petabytes of log data every month and Taobao generates dozens of terabytes of online transaction data every day [18]. At the same time, most of the data is unstructured, such as text and graph data, which are presented to users. These complex data structures are difficult to represent, and gradually become difficult and hot issues in data stream mining.
From steady-state data to data in a non-static environment, its learning method is constantly changing the learning process of data as shown in Fig. 3. Traditional learning generally uses a batch learning method, but it can only process a relatively small amount of data. With the advent of the big data era, more learning methods are needed, such as incremental learning, online learning and block-based learning methods, which are commonly used methods in data streams, are also highly efficient processing methods.

The process of data streams learning.
(1) Incremental Learning
Incremental learning includes receiving and integrating new instances, without the need to perform a complete learning phase from scratch. In this approach, the model will gradually evolve to adapt to changes in the incoming data. There are two ways to update the model: update by data instance and update by window. Incremental algorithms must learn from data faster than batch learning algorithms. Such algorithms must have very low time complexity. The algorithm can be modified to have the ability to incrementally learning. Incremental versions of support vector machines ISVM [19] and LASVM [20]. This online method incrementally selects a set of examples and uses support vector machines to learn from them. The ensemble learning algorithm will also add incremental learning technology. Incremental learning in ensemble classification can better apply to data streams arriving in time order. The classic algorithms include Learn++ [45], heterogeneous Bagging++ [18] and so on.
(2) Online Learning
The main difference between online learning and incremental learning is that instances are continuously obtained from the data stream and can only be processed once without re-storing and reprocessing. The requirements in terms of time complexity are stronger than incremental learning. Ideally, the online classifier implemented on the data stream must have a constant time complexity of O(1). The goal is to learn and predict at least as fast as the data streams arriving. Due to the nature of data streams that need to be processed in time and limited memory, online learning technology is widely used in data streams classification. In the data streams online learning algorithm Online Bagging and Online Boosting [21], the author compares the accuracy and processing time of online technology with batch processing technology through experiments. Algorithm OAUE [2] based on block addition mechanism and online technology. It can estimate the classifier error only for the window of the last instance in a fixed time and memory.
(3) Chunk Learning
Learning instances appear continuously in the form of data blocks. S ={ B1 ∪ B2 ∪ … ∪ Bn } locks appearing in the set are usually equal in size. Blocks are usually of equal size and the construction, evaluation, or updating of classifiers is done when all examples from a new block are available. This distinction may be connected with supervised or semi-supervised frameworks. For instance, in some problems data items are more naturally accumulated for some time and labeled in blocks while an access to class labels in an online setup is more demanding. The block-based learning method is also a common training method for data stream classification. The AWE algorithm [22] trains the classification on consecutive data blocks, and replaces the worst classifier with each update. Learn++.NSE algorithm [23] is one of the extensions of Learn++, which can learn data incrementally through blocks with fixed or different sizes, instead of batch learning without retaining old training examples.
Data stream classification is a variant of the traditional supervised machine learning classification task. The main difference between these tasks is that in streaming scenarios, instances are not easy to obtain as part of a large static data set classifier. On the contrary, instances are provided as a continuous stream of data quickly and sequentially over time. Therefore, data stream classifier must be prepared to handle a large number of instances, so that each instance can only be checked once or stored for a short period of time. This section will give a brief summary from the single model classification of the classification algorithm to the ensemble model. The paper focuses on the ensemble model to deal with the data streams in non-static situations. At the same time, in Fig. 4, the transition from traditional classification to data stream classification is compared. It contains common single classifiers and ensemble classification.

Traditional algorithms and data streams algorithms.
(1) Single Classification Model
In the classification process, a single classifier is constantly recursively updating its structure with newly arrived data. At the same time, the single classifier is also used as the base classifier of the ensemble model. Here, we will discuss the commonly used single model classifier.
In single-model data stream classification, decision trees, Naïve Bayes and SVM are classifiers that are often used by researchers. The VFDT algorithm [24] is considered to be the most reference decision tree classifier for mining large-scale data streams. In VFDT, the tree is constructed in an incremental manner and does not retain any instances. Blanco et al. [25] proposed an online hybrid tree (OHyT), which used two different segmentation methods to construct a decision tree and two methods showed good performance in experiments. In terms of using Bayesian data stream, Krawczyk et al. [26] proposed a Bayes-based WNB-CD algorithm to process the data stream, weighted the incoming data block samples and proposed two methods for calculating weights. Recently, Hemalatha et al. [27] proposed a hybrid decision tree IFDB algorithm based on incremental flexible naive Bayes, which uses kernel density estimation to model the continuous attributes of leaf nodes to improve class prediction accuracy. Since support vectors can handle very large data. Tsang et al. [28] proposed a Core Vector Machine (CVM) is a new method based on the concept of the smallest external sphere to extend the core method. And Rai et al. [29] proposed the stream model of the classic StreamSVM algorithm, but it is only used to process the data stream and cannot detect the concept drift in the data stream.
In data stream classification, KNN and neural networks are constantly being valued to explore new mining methods. KNN is a classification algorithm based on distance metric, which is a lazy learning method. Law et al. [30] proposed an incremental classification algorithm ANNCAD that uses multi-resolution data representation to find the adaptive nearest neighbors of test points. The update speed of ANNCAD is fast, and the exponential forgetting method effectively adapts to concept drift and is very suitable for mining data streams. In terms of complex data streams, Roseberry et al. [31] proposed ML-SAM-KNN, which is a multi-label classifier using self-tuning memory for drifting data streams. Short-term memory is used to store recent examples, while long-term memory is used to remember historical and unusual data. In terms of neural networks, Leite et al. [32] proposed an Evolutionary Granular Neural Network (eGNN) algorithm, which can deal with mutations and gradual changes in typical non-stationary environments. EGNN uses fuzzy neurons to build interpretable multi-scale local models for information fusion.
(2) Ensemble Classification Model
A single model usually cannot be classified smoothly in a fast data stream. Therefore, the combination of base classifiers can make the model more robust. The ensemble model uses various combination strategies to predict and classify instances. At the same time, it has advantages in processing various complex types of data streams. There are three main frameworks for the current ensemble methods. That is Bagging, Boosting and Stacking, which are currently the most used structures in ensemble algorithms. Because of their excellent performance, they have received extensive attention from researchers. These three structures and related algorithms will be introduced in detail below.
(a) Bagging Ensemble
Bagging method is a simple and effective method to generate independent model integration. Bootstrap sampling method is used to obtain examples for training, and voting method is usually used to predict. Researchers continue to explore the bagging algorithm and propose many algorithms based on this method. Hothorn et al. [33] proposed double-bagging. In this algorithm, out-of-bag examples are used to train two classifiers in each iteration. The author recommends using this method to deal with variable problems and bias selection methods. Oza et al. [21] proposed the online version of bagging to illustrate that online processing is superior to batch processing in terms of accuracy and runtime. In the field of facial recognition, the bagging algorithm is used to integrate Extreme Learning Machine (ELM) [34] to improve the recognition performance. Random Forest is also an ensemble model under an independent framework. It uses the decision tree without pruning constructed by the CART algorithm as the base classifier, which combines bagging and random feature selection.
(b) Boosting Ensemble
Boosting is a method to improve the accuracy of any given learning algorithm, and AdaBoost is the most representative algorithm in the Boosting family. The main idea of AdaBoost is to focus on instances that were misclassified during previous training. The degree of attention is determined by the weight assigned to each instance in the training set. As people continue to study the AdaBoost algorithm, a bunch of classic algorithms about the boosting family have been extended. In the field of image segmentation, Avidan et al. [35] proposed the SpatialBoost algorithm based on AdaBoost. The analysis results of synthetic images and real images show that it is superior to AdaBoost when spatial reasoning is involved. Cortes et al. [42] proposed DeepBoost, which can use a deep decision tree as a base classifier and achieves high accuracy without overfitting the data. In [37, 38] authors developed boosting-based General Regression Neural Network (GRNN) ensembles. Tkachenko et al. [36] developed a new neural network tool integration in order to improve the accuracy of the task of recovering and predicting missing IoT data. It consists of two successive regression neural networks (GRNN) and a successive geometric transformation model (SGTM) neural structure. On two consecutively connected general regression neural networks, the SGTM neural structure is used as a supplement to improve the accuracy of the prediction results. In [37], the missing data management tasks in intelligent systems are considered. A possible recovery method for predicting partial or complete loss of data based on the extended input SGTM neural structure is proposed to improve the two GRNN ensemble. The accuracy of the weighted summation process of the output displacement of the GRNN network is improved.
(c) Stacking Ensemble
Another well-known ensemble method is Stacking, which is based on a meta-learning method. The model uses the following steps: (a) Zero-level data. All basic learners run on the original data set, (b) One level data. After level zero, the prediction made by the classifier is considered new data, (c)Final prediction. Another learning process is to use the first-level data as the new input and output to obtain the final prediction. Ortiz-Díaz et al. [38] proposed the Fast Adaptive Stacking of Ensembles-Active Learning (FASE-AL) algorithm, which uses active learning to generalize classification models with unlabeled instances. FASE is a stacking integrated algorithm used to detect and adapt the model when the input data stream has conceptual drift. Ding et al. [39] proposed a new stacking algorithm, which defines cross entropy as the loss function of the classification problem. The training process uses a neural network with stochastic gradient descent technology. Izonin et al. [40] describes a prediction method using a new, stacking-based GRNN ensemble model. Each member of the developed ensemble processes its own dataset, where the vectors of the original set of data are randomly shifted relative to the current point. The authors chose SGTM neural-like structure as a meta-algorithm for the formation of the result of the ensemble. Yang et al. [41] proposed a new method based on the integration of multiple heterogeneous pre-trained deep convolutional neural network models (P-DCNN) and superposition algorithms, which can achieve high accuracy in inverse synthetic aperture radar (ISAR) images under the condition of small sample sets Automatic recognition of space targets.
(d) Combination strategy
Combining the predictions of the overall membership can improve overall performance. Through appropriate combination methods, accurate predictions can be made for difficult-to-classify instances. However, people spend a lot of energy to generate different sets of classifiers, but little is known about the method of combining the output of the classifiers. Here we mainly introduce the combination strategies of common weighting and complex meta-learning methods.
The research methodology used in this review is PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [44], which can make our review articles more rigorous and logical.
(1) The PRISMA Protocol
PRISMA is used to summarize the existing literature report results in a specific research field to get the final result. The program includes a four-stage flow chart (i.e. identification, screening, eligibility and inclusion), which helps to ensure consistent and accurate reporting of all types of system reviews and meta-analysis studies [44]. By using this program, the conclusions and results will be free from the bias of review research, but most reviews may be affected by selective results reporting. In addition, by defining relevant query criteria, various sources with qualification criteria can be used to exclude irrelevant articles.
The four-stage flow chart will first determine the source of the article, and then filter the search results of duplicate and irrelevant articles based on the title and abstract. Then the full-text papers are excluded according to the eligibility criteria to obtain the included research for further analysis.
(2) Inclusion Criteria and Database
Eligibility criteria are indispensable in assessing the effectiveness, applicability and comprehensiveness of the review [44]. It is worth noting that the design of exclusion criteria is incremental. For example, if an article is excluded by exclusion criterion 1, the article will be automatically excluded without further verification of additional exclusion criteria.
“Ensemble learning” or equivalent expressions;
“Concept drift data streams” or equivalent expressions;
“Imbalanced data streams” or equivalent expressions;
“Multi-label data streams” or equivalent expressions;
“Multi-instance learning” or equivalent expressions;
The search strategy is to search for data by entering search strings based on IC-1, IC-2, and IC-3 in the following databases: IEEE explorer and SpringerLink. These two databases are considered to be important and reliable sources of high-quality publications in the field of computer science and engineering. According to the above queries entered in the two databases, a total of 36645 potentially relevant articles were found. The five data stream types studied in this paper are steady, concept drift, imbalanced, multi-label and multi-instance data streams. 27223 articles were deleted by screening titles, abstracts and the remaining 9,422 articles were evaluated in detail. 370 of the 9422 articles were duplicates and were therefore deleted. Therefore, there are 9,052 studies left and 8,902 irrelevant articles according to the exclusion criteria (EC-1, EC-2, EC-3, Ec-4). Finally, the remaining 150 articles were manually screened to select the ensemble classification algorithm except for the 31 corresponding data streams. This review will review the performance of 31 documents from the algorithm and summarize the evaluation results. Figure 5 shows the flow chart of PRISMA’s current system review. Where N represents the total number of current papers and n represents the number of components.

PRISMA flowchart diagram for the current systematic review.
The ensemble learning method is a good data classification method. Because a single classifier cannot adapt to the model well in the fast-coming and changing data stream. Through the combination of multiple classifiers, the model can be more adapted to the complex data streams. Among them, the concept drift in the complex data stream is currently the most researched problem. Ensemble learning has different implementation strategies for different complex data streams, such as using key technologies such as online learning, block-based learning and incremental learning to make the model fit better. At the same time, Table 6 summarizes the performance comparison of ensembled algorithms for processing complex data streams.
Steady data streams
The number of steady data streams is the most stable of all data streams. It does not need to deal with concept drift, imbalanced, multi-labeling and other issues. It is not gradually the hotspot of people’s research, but for the integrity of the content. Some classic algorithms are worthy of research and discussion. It is an exploration of people’s early research on data. Ensemble learning is divided into block-based learning and online learning technology when processing steady-state data streams. The specific algorithm will be introduced below.
Method based on block learning
In the steady-state data stream, the researchers study static data and the data is trained in batches or blocks. Compared with online learning, the use of block-based algorithms may not be studied much. This section will review its classic algorithms.
The early classic Learn++ [45] is an ensemble algorithm based on neural network classifiers, which has been mentioned in many experimental comparison algorithms. The algorithm is to build a new neural network model on the incoming block data for incremental training set and then use the majority voting decision-making mechanism to make predictions. Minku et al. [46] introduced the use of negative correlation learning Negative Correlation Learning (NCL) in incremental learning. Negative correlation learning is a successful application in constructing neural network ensemble. The author proposed two models of fixed size and growing NCL(GNCL) to explore their degree of forgetting knowledge. Ensemble classifier is a main direction of incremental learning research and there are many ensemble incremental learning methods. Among them Learn++ derived from AdaBoost is very special, while the heterogeneous Bagging++algorithm [47] based on the analysis of the advantages and disadvantages of Learn++, the author proposes an incremental ensemble algorithm that uses bagging to generate an incremental ensemble for each block of data. The algorithm uses bagging to build a new model from the incoming data block. Experiments show that algorithm is superior to Learn++and negative correlation learning algorithms. Kidera et al. [48] proposed an incremental ensemble model based on the AdaBoost.M1 algorithm. In the model, the long-term memory resource allocation network Resource Allocation Network With Long-Term Memory (RAN-LTM) developed to achieve stable incremental learning is used as the classifier. At the same time, the author proposes a new weighted majority voting method to update the weight of the classifier.
Method based on online learning
Online learning is a commonly used technology in ensemble learning. It is a hot method for everyone to study whether it is in static or non-static data. The online learning algorithm processes the arrived instance once, without storing and processing and can make a faster response to the data stream classification.
The application of online technology in steady data is relatively extensive. The OzaBag algorithm [21] requires that the entire training set is immediately available, but in some cases random access to the data is required. The proposed algorithm solves this problem. It uses each base classifier to update with k copies of newly arrived data, where k follows the k Poisson (1) distribution and finally is classified by unweighted voting, which is equivalent to the batch version in terms of accuracy. The OzaBoost ensemble algorithm [21] maintains a fixed-size set. These classifiers are trained according to the received instance and each new instance updates the classifier in sequence. The instances in the sequence that were misclassified by the previous classifier will update their weights for the latter classifier to attention. When the classifier misclassifies an instance, the Poisson distribution parameter related to the instance is multiplied by 1/2ɛ, and becomes λ (1/2 (1 - ɛ)) when it is correctly classified, where ɛ is the classification error of the base model.
Random Forests (RFs) are often used in many computer visions and machine learning applications. However, in most applications, RFs are used offline and Online Random Forest (ORF) algorithm [49] is a combination of OzaBag, random feature selection and new tree growth methods. It allows online decision trees to be built and the author also added a time weighting scheme. Bifet et al. [3] proposed a new algorithm LevBag, which randomizes the weights of input stream instances to improve the accuracy of the ensemble. At the same time, the simplicity of the bagging technique is combined with adding more randomness to the input and output of the classifier. According to the sequential learning mechanism. Zhai et al. [50] proposed an efficient online extreme learning machine ensemble algorithm called Ensemble of Online Extreme Learning Machine (E-OS-ELM) for large data set classification, the algorithm is based on online sequential learning (OS- ELM) algorithm trains the base classifier, and uses a simple voting strategy to ensemble the trained base classifier.
Chapter summary
The above two subsections review the classic algorithm of static data streams, which is based on block and online ensemble classification. Although steady data streams have gradually ceased to be the scope of everyone’s research, their algorithm ideas are still worthy of discussion by researchers. Among the algorithms introduced above, we have selected several classic algorithms to quantitatively analyze the performance of the algorithm. From Table 1 (experimental data from their respective papers), it can be observed that under the same data set, the block-based algorithm Bagging++algorithm is superior to GNCL in terms of accuracy and running time. At the same time, the online-based LevBag [3] randomizes the weight of the input stream, which improves the accuracy of the ensemble classifier and the time is less than that of the OzaBag [21] algorithm.
Algorithms comparison
Algorithms comparison
Another biggest challenge in dealing with data in a dynamic environment is concept drift. This drift will occur when the concept involved in the collected data changes after a period of stability. Nowadays, researchers’ research on concept drift has been continuously increasing in the past ten years. This section mainly deals with concept drift data streams from the perspective of ensemble learning methods, using ensemble learning to ensemble block-based, online-based and other latest processing concepts. The technique of drift is discussed.
Method based on block learning
The idea of block-based ensemble is to divide the incoming data stream into data blocks to update the base classifier, replace the weak classifier with a new classifier. Then it uses a combination strategy to predict the result. The block-based method is a common method to adapt concept drift.
Wang et al. [51] proposed an Accuracy Weighted Ensemble (AWE) algorithm for classifier ensemble. It can adapt to potential concept drift, but there are problems with gradual drift in order to improve the algorithm. Deckert et al. [52] proposed a new concept drift processing algorithm Batch Weighted Ensemble (BWE), which is used to deal with sudden and gradual concept drift in labeled data with decision attributes. The learning examples are processed into blocks of the same size. Then drift detector is introduced into the ensemble framework, which has better accuracy than the performance of the standard precision weighted ensemble classifier. Brzezinski et al. [53] proposed a block-based data ensemble classification algorithm Accuracy Updated Ensemble (AUE2), which aims to respond to different types of concept drift. The main novelty of this algorithm is to combine the ensemble weighting mechanism with the incremental training of the component classifier. It is different from the linear function used in AWE. The weighting function in the article uses the calculation formula of the nonlinear weighting function Equations (2)–(4).
Where ωij is the weight of the base classifier and the function
Some current methods believe that it is inappropriate to update the ensemble while processing the continuous stream, such as infinite growth ensemble or deleting all its basic learners when trying to overcome concept drift. Bertini Junior et al. [54] proposed the Iterative Boosting Streaming Ensemble (IBS) algorithm, which applies boosting to block data and maintains ensemble by adding a certain number of basic learners. Elwell et al. [23] proposed the Learn++ for Non-Stationary Environments (Learn++.NSE) algorithm inspired by the boosting algorithm. It learns from continuous blocks of data without making any assumptions about the nature or drift speed. It can learn from the environment of constant or variable drift rate, adding or deleting concept classes and periodic drift. This method is gradual learning, just like the Learn++family member algorithm, there is no need to access the previously learned data.
The description of concept drift in the traditional data stream mining stage requires the availability of labeled samples, but it is not feasible to incorporate tags into simplified transactions in terms of time and resource utilization. Abdualrhman et al. [55] proposed a deterministic concept drift detection (DCDD) method based on ensemble classifiers. The ensemble structure uses AdaBoost. This drift detection method can describe concept drift regardless of the label assigned to the sample. This method has a high drift detection rate, a minimum false warning rate and a miss rate, and can detect concept drift in a scalable manner.
The online ensemble algorithm is designed to learn in a class label environment that is available after each instance, instead of learning in a block manner. With the arrival of class labels online, the online algorithm can be more effective than in the processing block environment. Respond quickly to concept drift. It is a model that is continuously revised according to sequential learning.
Weight has always been the key content of everyone’s research and through continuous innovation of weighting methods to adapt to concept drift Kolter et al. [5] proposed the Dynamic Weighted Majority (DWM) algorithm, it will maintain the weight of a base classifier Pool. It increases or deletes base classifiers according to the performance of the global algorithm. Each base classifier has a weight. If the base classifier makes a prediction error, it will use the β multiplication factor to reduce the weight of the base classifier. Experiments prove that the classifier can keep learning well in the concept drift environment. OAUE [2] is a very classic algorithm, which ensembles the weighting mechanism based on the block method and online learning. This method converts the block-based ensemble to an online learner method, which trains each stored instance and weights the classifier. The innovation puts forward an efficient weighting function of the base classifier. And recently Aancy et al. [58] proposed a dynamic streaming data mining method based on adaptive online learning rules (SOAR). SOAR ensembles the basic characteristics of block and network ensemble. Then it updates the weight of each classifier based on its quality. It promotes adaptive windows to deal with gradual and sudden concept drift.
Diversity for Dealing with Drift (DDD) algorithm is proposed by Minku et al. [56] according to different types of drift, different ensemble levels can get the best priority accuracy. DDD maintains ensemble systems with different levels of diversity which uses the advantages of diversity to deal with drift. And it uses information from old concepts to help learn new concepts. Diversified Online Ensembles Detection (DOED) proposed by Sidhu et al. [57]. This method retains two ensembles of weighted classifiers. One is the ensemble of low diversity and the other is the ensemble of high diversity. Classifiers update according to the classification accuracy of new data instances. In recent years, some researchers have combined dynamic weighting and diversity to improve extremely high accuracy. Sidhu et al. [59] proposed an online ensemble method, Diversity Dynamic Weighted Majority (DDWM) algorithm, which has different concept distributions. This method maintains two sets of weights with different levels of diversity. Updating or deleting an ensembled classifier is based on the classification accuracy. Adding a new classifier is based on the final global prediction of the algorithm and ensemble global prediction of any instance.
Chapter summary
Concept drift in data streams is currently the most extensive research direction. Ensemble learning is a powerful tool to deal with concept drift. The above subsections summarize the concept drift ensemble classification algorithm from a block-based and online perspective. We carefully analyze their robustness through experimental indicators from the following classic algorithms. Due to AUE2 [53] combines AWE’s [51] ensemble weighting mechanism and incremental training of component classifiers, it can be observed in Table 2 that the AUE2 algorithm improves the classification accuracy and reduces the operation time compared to AWE algorithm. Based on the online OAUE [2] algorithm, a new weighting function is proposed to improve the accuracy compared with DWM [5], but on the contrary it increases the running time and space complexity.
Algorithms comparison
Algorithms comparison
In addition to concept drift problem, non-stationary data is also affected by the complexity of the data. In particular, the class imbalance that is often encountered, the main feature is that the number of samples in a certain category is significantly more than the number of samples in other categories. Class imbalance is even an obstacle to static data learning. Because the classifier is biased towards most classes and tends to misclassify instances of a few classes.
Many methods have been proposed in the literature to deal with learning from imbalanced data. Such as data sampling methods, cost-sensitive algorithms, single-class classifiers and ensemble classification methods. This section mainly focuses on the processing of imbalanced data around ensemble learning methods. Data preprocessing and cost-sensitive algorithms are discussed in conjunction with ensembled learning.
Data preprocessing method
In imbalanced data, researchers hope to reduce the degree of imbalance between minority samples and majority classes at the data level, in which data resampling is a representative method to obtain uniform class distribution. The methods based on resampling mainly include under-sampling, over-sampling and mixed sampling of the combination of the two. Under-sampling is to remove some instances from most classes and oversampling is to add a few instances of classes.
Synthetic Minority Over-sampling Technique (SMOTE) algorithm [60], it uses linear interpolation to create minority samples. In recent years, ensemble learning is combined with SMOTE to improve the classification accuracy. The SMOTE Boost algorithm proposed by Chawla et al. [6] successfully utilizes the advantages of Boosting and SMOTE algorithms. The Boosting algorithm improves the accuracy of the classifier by focusing on hard-to-learn examples while SMOTE only focuses on a few realities. Embedding SMOTE into the Boosting algorithm pays more attention to the instances of the minority class. At the same time, SMOTE can increase the instances of the minority class.
Wang et al. [61] improved the algorithms oversampling-based online bagging (OOB) and under-sampling-based online bagging (UOB). The author uses resampling and time attenuation metrics to improve the two ensemble learning methods of OOB and UOB to overcome the problem of online class imbalance. Finally, combining their respective advantages, the weighted ensemble methods WEOB1 and WEOB2 are both more accurate than OOB. Its robustness is better than UOB. Recently, Zyblewski et al. [63] proposed a framework that combines dynamic ensemble and preprocessing technology to deal with highly imbalanced data streams. Then they proposed a sampling method that uses a hierarchical bagging classifier to replace the minority and majority classes. Experiments show that the method of combining dynamic ensemble selection and data preprocessing is better than the latest online and block-based methods.
The problem of concept drift is also faced in the imbalanced data streams, which is an emerging content of current research. Ancy et al. [62] proposed the Handling Imbalanced Data with Concept Drift (HIDC) method. The algorithm focuses on the preprocessing of data and the classification process of stream data. When faced with class imbalance, oversampling and under-sampling techniques are used. When drift occurs, the implicit weighting scheme is applied to the base classifier to compare and replace the worst classifier. Zhang et al. [64] proposed Resembling-based Ensemble Framework for Drifting Imbalanced Stream (RE-DI) ensemble framework. The ensemble framework consists of a long-term static classifier that handles gradual changes and multiple dynamic classifiers that handle sudden conceptual drift. The resampling buffer is used to store instances of the minority class to solve the imbalanced distribution.
Cost-sensitive learning method
In addition to changing the distribution of data to reduce the impact of class imbalance on classification, it can also be considered from the level of algorithms. It mainly includes cost-sensitive learning and ensemble learning. The misclassification costs of different class are introduced in the learning and the higher misclassification costs are assigned to the minority class to improve its generalization performance. Cost-sensitive classification considers the different costs of different misclassification types. The cost matrix encodes the penalty for classifying samples from one category to another. Let C (i, j) denote the cost of predicting an instance in category i as category j. Here a cost matrix such as Table 3 is defined for two classifications. When dealing with the problem of category imbalance, the cognitive importance of positive examples is higher than that of negative examples. Therefore, the cost of misclassifying a positive instance is greater than the cost of misclassifying a negative instance, namely C (+ , -) > C (- , +) and making the correct classification usually does not bring any penalty. Then the cost-sensitive learning process seeks to minimize the number of costly errors and the total misclassification cost.
Confusion matrix
Confusion matrix
Sun et al. [65] combined cost-sensitive and Boosting algorithm. The paper comprehensively analyzed the advantages and disadvantages of AdaBoost algorithm in solving class imbalance problems. Then they explored three cost-sensitive boosting algorithms, namely AdaC1, AdaC2, AdaC3.The cost term is introduced into the learning framework of AdaBoost. Further analysis shows that the proposed algorithm conforms to the phased additive modeling in statistics to reduce the cost index loss. At the same time, Tao et al. [66] proposed a cost-sensitive ensemble algorithm for support vector machines based on cost weights. In this method, in order to ensure the consistency of the optimization goal of the weak learner and the boosting scheme, the author not only uses the cost-sensitive support vector machine as the basic weak learner, but also modifies the standard boosting scheme to be cost-sensitive. In order to ensure that the continuous classifier has more training examples, especially the few classification examples at the boundary, an adaptive sequential misclassification cost weight determination method is proposed.
The ensemble of neural networks to deal with imbalanced data is also a hot topic of research. Wong et al. [67] proposed two new cost-sensitive methods, namely, Cost-Sensitive Deep Neural Network (CSDNN) and Cost-Sensitive Deep Neural Network Ensemble (CSDE) to Solve the problem of class imbalance. Among them CSDE is ensemble learning version of CSDNN. In order to improve the generalization performance of CSDNN, random under-sampling and hierarchical feature extraction of hidden layers of deep neural network are used in CSDE. Recently, Loezer et al. [68] proposed a cost-sensitive Adaptive Random Forest (CSARF) algorithm and compared it with the adaptive data set on 6 real worlds and 6 different analogs. Random forest (ARF) and random forest with resampling (ARFRE) are compared. And the results show that CSARF is better than ARF in average recall rate and average F1.
This chapter summarizes the common handling methods for imbalanced data streams. Using data preprocessing and cost-sensitive algorithms to deal with imbalanced data classification that is common in daily life. Table 4 shows that the proposed RE-DI algorithm which uses an ensemble classifier composed of a static classifier and multiple dynamic classifiers sliding windows. The algorithm is based on a new resampling method to process unbalanced data. It can be observed from the table that the experimental indicators AUC and Accuracy under the same data set show that it is better than the other three algorithms.
Algorithms comparison
Algorithms comparison
In recent years, non-standard data has attracted more attention from machine learning scholars. Researchers are paying more attention to the learning of multiple goals. The data source may be accompanied by multiple class labels or the composed data is multiple instances, so it needs to be able to effectively adapt to these Examples of learning methods. Although most of the current research focuses on static, non-streaming frameworks, there are also some researches that focus on more complex data streams. This summary will summarize the research progress of multi-label and multi-instance classification of non-standard data streams in recent years, as well as their respective learning algorithms using ensembled methods.
Multi-label data streams
Formally multi-label streams classification problem is to train a model with sub-label Y ⊆ L for each instance in a high-speed data stream, where L is a label set. Although multi-label classification has been studied in traditional database mining scenarios, multi-label data stream classification is a relatively new concept and has not been fully resolved. Multi-label classification methods are usually divided into two categories. (1) problem transformation (2) algorithm adaptation. The problem conversion method converts a multi-label problem into one or more single-label problems.
Read et al. [69] proposed Ensemble of Classifier Chains (ECC) to deal with multi-label classification. This method is developed on the basis of the binary correlation method, which reduces the computational complexity compared with the more complex methods currently used. By passing the label-related information along the classifier chain, the author overcomes the shortcomings of the binary association method and obtains better prediction performance while maintaining low computational complexity. Wang et al. [70] proposed the Streaming Weighting ML-KNN based Ensemble Classifier (SWMEC) method, which is an ensemble model of multi-label data stream classification based on ML-KNN. The weighted ensemble model proposed by the author uses multi-label data stream to effectively update the model.
In recent years, applications such as text classification, web and social media mining have increasingly demanded multi-label classification algorithms. Nguyen et al. [71] introduced a scalable multi-label data classification ensemble method (MVI) based on online variable reasoning, in which random projections are used to create an ensemble system. As a second-order generation method, the classifier can effectively use the underlying structure of the data in the learning process. At the same time, multi-label is driven by the challenge of concept drift. Sun et al. [72] proposed the multi-label ensemble with Adaptive Windowing (MLAW) algorithm to design an effective ensemble paradigm for multi-label data stream classification. The algorithm deploys a novel change detection based on the Jensen-Shannon divergence to identify different types of conceptual drift in the data streams. In [73], a sliding window mechanism is also used to establish an ensemble model based on the COINS algorithm. The semi-supervised technology is applied to the problem of multi-label data stream classification. At the same time, a new label emergence detection algorithm is proposed and applied to data stream classification to refine the model and improve the accuracy of the classification algorithm.
Multi-instance data streams
In traditional supervised learning, there is only a single instance. This instance can only be identified by one label as shown in Fig. 6(a). However, in multi-instance learning, data structure is more complicate. Each object or target variable is defined by an uncertain number of feature vectors as shown in Fig. 6(b). In multi-instance learning, there is the concept of a package. A package is a group of instances and the associated class label belongs to the package. The use of ensemble learning methods to process multi-instance data is also a research direction and it has excellent performance in the evaluation indicators of classification tasks.

Single instance learning and Multi instance learning.
Bjerring et al. [74] proposed a randomized ensemble method Multi-Instance Rule Induction (MIRI) which was improved based on the MITI algorithm [69]. It is a single learner of decision tree, specially designed for multi-instance classification, but the splitting criterion of the algorithm has a significant impact on accuracy. Through modification, the algorithm is changed into a regular learner. Finally, the author proposes a randomization method resulting in an ensembled model. In recent years, neural network-based multi-instance classification algorithm BP-MIP and multi-instance regression algorithm BP-MIR have been proposed one after another. Zhang et al. [76] introduced neural network ensemble technology to solve the multi-instance learning problem. They constructed BP-MIP ensemble and BP-MIR ensemble respectively. Multi-instance neural network ensemble is better than single multi-instance neural network in solving multi-instance problems.
In the image target detection collection, Babenko et al. [76] proposed an improved online boosting scheme Online MILBoost (OMB). They believe that once a packet is marked as a positive packet, then all instances in it are also positive, so it can be used for training. You can use a stream process similar to standard online boosting to update the ensemble structure with a new model. For the same target detection task, Wang et al. [77] proposed a semi-supervised online weighted multi-instance learning ensemble method Semi-Weighted Multi Instance Learning (Semi-WMIL). In order to make full use of the target and its surrounding background information, combined with semi-supervised learning, multi-instance learning and Bayes theorem, a new single target detection tracker is proposed. In the parameter update stage of each frame, the tracker uses the inconsistency function of the block-based labeled and unlabeled training samples when selecting the optimal base classifier.
This section summarizes an emerging direction that is currently being researched, namely multi-label and multi-instance algorithms. The fields involved in this direction are mainly image, video and so on. Due to the complex characteristics of its own data source, its evaluation indicators are also diversified. As can be seen in Table 5, the proposed algorithm MLAW [72] is superior to the first three algorithms in the two indicators Hamming loss and Subset accuracy due to its periodic change detection mechanism that can reuse the previous concepts.
Algorithms comparison
Algorithms comparison
Comparison of ensemble classification algorithms for complex data streams
Data streams classification has been extended to many fields by researchers to conduct streams data mining and extract useful information for research applications. As the Internet age continues to touch all walks of life, people’s demand for classification tasks is also constantly expanding. In stream mining tasks, text classification, graph data classification and big data classification. These fields are also advanced problems of data stream classification that researchers are studying. Due to the significant changes in the representation, structure and magnitude of data. More complex algorithms are required for analysis. This section will analyze the results of these data streams in recent years.
Text data streams
Text classification in the Natural Language Processing (NLP) has always been a hot research direction. Popular social platforms like Twitter and Weibo will analyze news content, social interactions and advertisements. They can recommend more personalized services to users. However, text stream classification is still an emerging direction of text classification. The timely arrival of the data streams involved and the characteristics of one-time processing need to be considered by researchers. The current text streams classification methods are also relatively diverse, such as clustering methods, distributed processing of large-scale text streams and ensemble methods.
Trofimov et al. [78] analyzed the limitations, challenges and solutions of distributed text stream s classification. Text stream classification is an important problem that is difficult to solve on a large scale. Batch processing systems are widely used for text classification tasks, but they cannot provide low latency. Distributed stream processing systems can provide low latency, but they do not support the same level of fault tolerance and determinism as batch processing systems. In this work, the author demonstrates how distributed stream processing features affect the results of a typical text classification data stream. The analysis shows a trade-off between fault tolerance and reproducibility and performance.
Text streams mining based on concept drift is a very challenging research topic. The detection of concept drift is a computationally complex problem. Song et al. [79] proposed a new ensemble model Dynamic Cluster Forest (DCF) for text stream classification with concept drift. The ensemble model proposed in this paper is based on multiple Clustering Trees (CTs). In particular the DCF model has two novel strategies: (1) an adaptive ensemble strategy that dynamically selects discriminative CTs according to the inherent characteristics of the data stream. (2) a dual voting strategy that simultaneously considers the reliability and accuracy of the classifier. At the same time, Yang et al. [80] proposed unlabeled keywords to classify the text stream to reduce the burden of manual labeling. They use keywords and unlabeled documents to build a basic text classifier to classify the text streams and use the ensemble classifier algorithm to deal with the problem of concept drift in the text data stream. Most research on stream data classification is based on the assumption that the data can be fully labeled. However, in real applications, manually marking the entire stream for training is impractical and time-consuming. In data streams environment, it is common to have only a small portion of positive data and a large amount of unlabeled data. In this case, applying the traditional stream algorithm directly to the positive unlabeled stream may not work well or lead to poor performance. Pan et al. [81] proposed a dynamic classifier ensemble method for the classification of positive text streams and unlabeled text streams (DCEPU). In the classification stage, the problem is solved by constructing a suitable evaluation set and designing a new dynamic weighting scheme.
Traditional machine learning methods use features such as word frequency to represent documents in a vector space. It has limitations in processing the order and semantics of words. Upadhyay et al. [82] proposed a weighted classifier ensemble algorithm to solve the text classification problem. When combining classifiers, this algorithm associates the predictions of each classifier with different weights instead of using majority voting as a combination method. The particle swarm optimization algorithm is used to minimize the loss function of the training data to obtain the optimal weight. Samami et al. [83] proposed an optimal text classification method based on natural inspired algorithms and ensembled classifiers. In this model, an optimization algorithm based on biogeography and an ensembled classifier are used for feature selection. Compared with a single classifier, use of an ensemble classifier for classification can provide better performance for optimal text classification.
Graph data streams
The problem of graph classification appears in the context of many classic fields, such as chemical and biological data, networks and communication networks. In recent years, with the emergence of Internet applications, people have become more and more interested in dynamic graph streaming applications. Such network applications are defined on the basis of a large number of nodes. Applications such as (1) Decompose the communication patterns of users in a social network within an appropriate time window into a set of disconnected graphs, which are defined in a large number of user node domains. (2) The user’s browsing mode is usually a sub-graphic of the web map. The most challenging situation in the classification of graph data streams is that data only exists in the form of graph streams. Therefore, it is necessary to detect changes in the data stream to perform real-time classification in the classification of graph data streams.
The dynamic change of the label may indicate an important event or activity pattern. Aggarwal et al. [84] studied the differential classification problem in the graph stream in the paper, in which predicted important classification event is the change of the node classification label. Unlike steady ensemble classification problems. This method focuses on detecting changes in node classification dynamically and in real time, instead of actually classifying nodes. The differential data classification problem can also be considered as a general form of the non-dispersive event detection problem, in which node labels are used to supervise the detection process. The experimental results show the effectiveness of this method.
In many practical applications, graphs can be defined on a very large set of nodes. The large field size of the underlying graph makes it difficult to learn the summary structure information of the classification problem, especially in the case of data. To meet these challenges, Aggarwal [85] proposed a probabilistic method for constructing “in-memory” summary of underlying structure data. The digest uses a two-dimensional hashing scheme to determine the authentication mode in the underlying map. The author provided the probability boundary of the model quality determined by the process and proved the quality experimentally on some real data sets.
There is also an imbalanced problem in the classification task of graph data streams. Pan et al. [86] proposed a classification model to deal with imbalanced graph data with noise which is graph ensemble boosting. It uses an ensemble-based framework to graph data streams divided into blocks containing many types of imbalanced noise graphs. A boosting algorithm is proposed for each independent block, which combines discriminative subgraph mode selection and model learning as a unified graph classification framework. In order to solve the problem of concept drift in the graph data stream, an instance-level weighting mechanism is used to dynamically adjust the instance weights. Through the instance-level weighting mechanism, the enhanced framework can emphasize difficult graph samples. The classifier constructed from different graph data blocks which constitute an ensemble for the classification of graph data streams.
In the chemical industry, there are often fault classifications. Liu et al. [87] proposed an ensemble FDA model based on manifold-preserving sparse graphs. First, on the premise of maintaining the basic manifold structure of the training samples. Some mislabeled samples are filtered using the sparse graph of the retained manifold. Secondly, in order to improve the model’s robustness to left-standard error samples, the FDA-based Bagging algorithm was used to construct multiple sub-classifiers and these sub-classifiers were combined to form a robust classifier. Experiments proved the robustness of the algorithm model.
Su et al. [88] proposed a new method of multi-label classification, which relies on the ensemble learning of a set of random output graphs on multiple labels. And a structure of kernel-based output learner as the base classifier. For ensemble learning, the difference between the output graphs provides the required diversity of base classifiers and improves performance as the ensemble scale continues to increase. The author studied different methods of forming ensemble predictions, including majority voting and two methods of reasoning about the graph structure before or after merging the basic model into the ensemble. The behavior of the multi-standard ensemble is explained theoretically from the two aspects of the diversity and consistency of the micro-standard prediction and the previous research on the single-target ensemble is promoted.
Big data streams
In recent years, big data has become a key field representing the modern information and digital world. A large amount of data is being generated every moment in life. In the latest development, IBM uses the “5Vs” model to describe big data. In addition, some classic mining algorithms need to transfer the entire data set multiple times. Although distributed data mining and data stream mining have been extensively studied, there are relatively few studies on distributed data stream mining methods.
Big data streams not only continuously receive large amounts of data, but may also have different types of characteristics. In addition, concepts and functions tend to evolve throughout the process. Haque et al. [9] designed a method based on multi-layer ensemble called HSMiner to solve the above-mentioned challenges that is to mark instances in the ever-changing big data stream. When classifying big data streams, HSMiner may face scalability issues. To solve this problem, the author proposes three methods to use MapReduce-based parallelism to build these massive AdaBoost ensemble.
The Vertical Hoeffding Tree (VHT) algorithm [8], which is the first distributed data algorithm for learning decision trees which can be used to perform classification tasks on large data streams arriving at high speed. VHT is a novel method of vertical distribution of decision trees. The algorithm is implemented on Apache SAMOA, so it can run on real-world clusters. Through detailed experiments and compared with the centralized sequential tree model, it is shown that VHT can handle dense and sparse instances with thousands of attributes, and the accuracy is very small. In large-scale imbalanced data, Lin et al. [89] proposed an ensemble random forest algorithm based on Apache Spark, which can be used for large-scale imbalanced classification of insurance business data. The experimental results show that the ensemble random forest algorithm is more suitable for insurance product recommendation or potential customer analysis than traditional SVM, logistic regression and other strong classifiers. Combining the proposed bootstrap under-sampling algorithm with KNN, it can be used for the preprocessing of imbalanced classification algorithms. The ensemble learning algorithm combined with bootstrap sampling preprocessing can further reduce learning.
In dealing with large-scale imbalanced data aspect, Mwangi et al. [90] proposed an ensemble model of RUSBoost combined with cost-sensitive CNN. The combination of RUS and boosting can overcome the imbalanced performance of the RUS algorithm at the large data level, while the convolutional neural adding a cost layer to the network can achieve the purpose of allowing the classifier to automatically learn the cost during the training process, thereby improving the performance level of the model.
Real-time mining of ever-changing data streams faces new challenges. Such as ever-increasing data volume, speed and volatility require real-time data processing, the ability to react quickly and adapt to changes. Marron et al. [91] proposed a random forest ensemble based on random hoeffding tree real-time data streams classification method which using vector SIMD and the multi-core capabilities available in modern processors to provide the required throughput and accuracy.
The rise of the Internet of things and other sensor networks has created many vertically distributed and high-speed data streams, which require specialized algorithms to realize true distributed data mining. Denham et al. [92] proposed a new hierarchical distributed stream miner Hierarchical Distributed Stream Miner (HDSM), which can learn the characteristics of different data streams in the case of the smallest data transmission to the central location. relationship. Experimental evaluation shows that compared with the previously proposed distributed stream mining method, the classification accuracy has been significantly improved.
Complex data streams evaluation methods
The correct evaluation of classifiers or regression models is a key problem of machine learning, as shown in Fig. 7. The use of correct and appropriate evaluation measures is also an important part of the experiment. Only through accurate data can evaluation measures show the advantages of the proposed algorithm. Performance evaluation is the last step of classification. The performance of the classifier is indispensable in the classification process. The quality of the data to be classified determines the quality of the algorithm used. The following will summarize the verification techniques of data stream classification that researchers often use in the data stream environment, their unique performance evaluation methods for complex data streams and a detailed discussion on the software platform they use.

Data evaluation process framework.
In the context of steady and batch learning, the most commonly used scenario for estimation measures is cross-validation. Due to the limited amount of traditional data, cross-validation is used to increase data usage and its biggest disadvantage is time-consuming. However, it is not directly applicable in a data environment with strict calculation requirements and conceptual drift. Although the number of streaming learning algorithms continues to increase, the indicators and experimental design for evaluating the quality of learning models are still open issues. The evaluation technology is proposed to determine which examples are used to train which examples are used to test the learning model. This section summarizes the widely used verification techniques in data experiments in many documents.
where E (|trueerror k - estimatederror k | q ) is no larger for the prequential k– fold strategy than for a prequential evaluation strategy.
After the above discussion, the error of the priority k-fold strategy is smaller than that of the priority method. This evaluation method is used in the experimental verification of the adaptive random forest (ARF) algorithm proposed by Gomes et al. [95].
As people pay more attention to streaming data stream mining nowadays, various algorithms for complex data streams have been proposed one after another. However, a good algorithm for processing data stream classification requires corresponding evaluation indicators to verify. This section is a discussion on the evaluation indicators of complex data streams that have been widely studied. Table 8 summarizes the evaluation indicators of complex data streams algorithms. Different types of data streams have their own unique indicators. Using these to measure the performance of the algorithm allows researchers to clearly understand the algorithm itself.
Confusion matrix
Confusion matrix
Evaluation indicators for complex data streams algorithms
(1) Traditional Classification Evaluation Indicators
In the traditional machine learning classification of steady-state data streams, the most important indicators of the classification algorithm by researchers are focused on accuracy, error rate, memory and running time, regardless of the four indicators of regression or classification tasks are the top priority. Changes in some of these indicators reflect the criteria for the quality of the proposed algorithm. Running time is also called time complexity and it is usually calculated in seconds. However, in order to measure the performance of the research algorithm more comprehensively, more complex evaluation indicators, namely Precision and Recall are proposed, which can better evaluate traditional classification or regression tasks. Precision is used to measure the precision of the classifier, which means that the positive samples that are identified as positive account for the proportion of all samples that are identified as positive. The greater the accuracy rate, the higher the accuracy of the classifier. Recall is used to measure the recall rate of the classifier, which means that the positive samples that are identified as positive account for the proportion of all positive samples. The higher recall rate, the better the classification performance of the classifier.
(2) Data Streams Classification Evaluation Indicators
In the data stream classification scenario, various complicated situations will be encountered, such as different concepts of the classification before and after, and the number of tags of the number stream is not the same. The evaluation of these complex situations requires researchers to study in different categories. The following will summarize the classification and evaluation indicators of non-static data streams used by researchers in the literature in the classification of non-static data streams. In a non-static environment, the traditional classification evaluation index can still be used as a general evaluation index. There are new indicators for the evaluation of time in a non-static environment. Bift et al. [97] proposed RAM-Hours, which is a one-dimensional amount of computing resources used by the streaming algorithm based on the cloud computing service rental cost option.
It can be seen that the size of the F-measure value related recall and precision. Only when the recall rate and precision rate are both large, the value of F-measure will increase accordingly. Thus F-measure can correctly evaluate the classification of the classifier for majority-class and minority-class samples performance.
Kappa statistics is a frequently used statistical method in uneven data. Among them, standard Kappa K is also a commonly used evaluation index, and Bift et al. [97] et al. proposed the generalized Kapp M, which is more suitable for unbalanced data than standard K. Unbalanced data on the chart, receiver operating characteristic Receiver Operating Characteristic (ROC) and Area Under ROC Curve (AUC) are also commonly used methods to evaluate the performance of the classifier.
In addition to the above commonly used evaluation indicators, some people have also proposed more complex methods to evaluate other characteristics of the algorithm. Shake et al. [100] proposed recovery analysis, which provides a concept learner with the ability to quickly discover concept drift and take appropriate measures to maintain the quality and generalization performance of the model. Not only the effect of the model in reducing errors in the new decision space is considered, but also the time required.
Research directions and prospects
Although many types of data streams now have their own solutions, there are still many types of data streams that need to be explored and studied by researchers at the algorithm level. For example, mining delayed label streams, processing multiple types of concept drift data, processing multiple parallel existing streams and ensemble learning algorithms processing uncertain data streams, etc. These issues will be carefully analyzed below.
(1) The challenge of uncertain data streams ensemble classification
Mining uncertain data stream classification is a current research direction that is rarely studied. Currently available data streams classification algorithms are dedicated to accurate data mining, but uncertain data is very common in real applications. In the sensor network collection and transmission process, information such as humidity, temperature and weather contains a lot of uncertainty. In recent years, there has been a lot of research on uncertain data. For its classification research, especially the direction of ensemble classification. It is a scarce field. The author intends to use a detection technology to detect the uncertainty of the attribute value in the data stream and intends to use the decision tree as the base classifier for ensemble to realize the classification of the uncertain data stream.
(2) The challenge of multi-type streams ensemble classification
Nowadays, most stream ensemble algorithms only have a single type stream. However, in some applications, such as event history analysis involving data review, there may be multiple parallel streams. In such multiple streams, the same data event may appear at different times in each stream and may have different descriptions. This brings up several interesting and new challenges, such as how to aggregate information about the same event available in different streams, how to predict the moment when an event occurs in one of the streams and know the knowledge of other streams and whether to develop a commitment new ensemble for handling multiple streams. The author is thinking of using sliding window technology to incrementally learn data sources, and in order to improve accuracy, use the ensemble of heterogeneous base classifier structures to adapt to different data sources.
(3) The challenge of delayed data streams ensemble classification
Most of the current frameworks assume that the access to the target value is immediate or without too much delay. However, in real life, many are delayed predictions. An example is weather forecast, our forecast will be evaluated in the future. This is related to the problem of label delay, even if we can access such information, it will not be available immediately after the arrival of the new instance. When it is not possible to quickly mark all the examples, it is still possible to obtain the true target value for a limited number of examples at a reasonable cost. The author thinks that active learning can be combined with ensemble methods, active technology is used for sampling, and ensembled methods can be used to better train and improve accuracy.
(4) The challenge of multi-type concept drift ensemble classification
As the data generated in people’s production and life is continuous, at the same time, various new concepts are mixed in the data. In the data streams classification process, most algorithms still focus on detecting a certain type of concept drift, but this situation is in line with the actual status quo. More efficient algorithms should be studied to deal with multiple types of concept drift to continuously improve the accuracy of classification. The author is thinking about the next step, using online learning, because it has incremental characteristics that can recall previously learned knowledge and consider the diversity of ensemble classifiers to improve classification accuracy to deal with drift problems.
Conclusion
Predicting the data streams arriving at high speed is inherently difficult. Ensemble learning methods can make the data better fit the model and improve performance. This article reviews the existing complex data streams ensemble classification algorithms. First outline the basic background of data streams and methodology. Then we discuss in detail the ensemble method of the complex data streams, the application domain data streams and its evaluation technology. Finally, in view of the challenges faced by the ensemble classification of complex data streams and the next research direction is proposed, so that these problems can be solved or have more research significance. Ensemble learning has been proven to greatly improve the performance of things prediction, but the comparison of algorithms is adopted in this review. The ensemble structure becomes more and more complex, and the running time becomes slower. Since the algorithm will be affected by factors such as ensemble size, parameters and framework. So ensemble algorithm requires researchers to optimize the model, such as pruning strategies and so on.
At the same time, this review also has some shortcomings. The article’s introduction to the domain data streams is not comprehensive enough, such as the application of intrusion detection and sensor data streams which need further improvement.
