Literature review and analysis on big data stream classification techniques

Abstract

Rapid growth in technology and information lead the human to witness the improved growth in velocity, volume of data, and variety. The data in the business organizations demonstrate the development of big data applications. Because of the improving demand of applications, analysis of sophisticated streaming big data tends to become a significant area in data mining. One of the significant aspects of the research is employing deep learning approaches for effective extraction of complex data representations. Accordingly, this survey provides the detailed review of big data classification methodologies, like deep learning based techniques, Convolutional Neural Network (CNN) based techniques, K-Nearest Neighbor (KNN) based techniques, Neural Network (NN) based techniques, fuzzy based techniques, and Support vector based techniques, and so on. Moreover, a detailed study is made by concerning the parameters, like evaluation metrics, implementation tool, employed framework, datasets utilized, adopted classification methods, and accuracy range obtained by various techniques. Eventually, the research gaps and issues of various big data classification schemes are presented.

Keywords

Big data streaming classification CNN accuracy Map Reduce framework

1. Introduction

Rapid improvement of World Wide Web (WWB), internet data rises and becomes big data. Based on the various properties of big data, big data classification is utilized in network bandwidth, security Filtering network data management, category management, network reputation management, green internet, etc. Big Data is described using velocity, volume, and variety, [57]. If the variety, velocity, and volume of the data are enhanced, the recent technologies are used to manage processing as well as storage of the data. The term Big Data Analytics is nothing, but the process of understanding and analysing the characteristics of vast datasets to extract statistical and geometric patterns [56]. Most of the data generated is originally streaming data. This is because of the data representing actions, measurements, and interactions, which come from the internet. Data is produced from an interval of time. In the streaming framework, high speed data, and algorithms must process very strict constraints of time and space. Streaming algorithms utilize data structures to give speed, and optimal answers [44].

Figure 1.

Classification of big data techniques.

One of the beneficial tasks in the applications of marketing, biomedicine and social media is the big data classification. The commonly employed model for resolving the challenges of the big data is the single traditional classification model [58]. The classifiers used for big data classification are Naive Bayes (NB), KNN, SVM, and so on. NB classifiers are widely used in information. Data fusion machine analytics is included for big data classification [11], NB is used for classification and target tracking in cloud computing [12] as well as robotics control [13]. Image and text are required for big data analysis, and for cyber analysis it consists of elastic learning and scalable methods [14, 10]. By its nature, KNN classifier is a slow classifier and does not have a small fixed-size training model for testing [9]. SVM is employed as a binary classifier, but nowadays, Multiclass SVMs (MSVMs) are used to classify vectors into different sets with the usage of trained oracles, are also being widely studied [5]. In SVM, the training activity is executed by introducing an optimal hyper plane [7].

The primary intention of this study is to provide a detailed survey of the various big data classification approaches. This review deliberates the existing methods employed in the distinct research works. The survey is done by analysing the employed framework, implementation tool, evaluation metrics, and so on, provided in the research works. Moreover, the adopted framework and the accuracy range are taken for performance evaluation of the suggested big data classification methods. The existing approaches have been categorized into distinct classification and then, the further survey is performed for the exploitation of research gaps and issues. Thus, it acts as the motivation for the future extension of effective big data management.

This paper is organized as: Section 1 gives the brief introduction about the paper, Section 2 provides the literature review of the existing classification approaches and Section 3 discusses the research gaps and issues. The analysis and discussion of the survey is deliberated in Section 4, and Section 5 concludes the paper.

2. Literature survey on different big data classification methodologies

This section extensively discusses the survey of different big data classification techniques. The categorization of distinct big data classification systems is pictorized in Fig. 1. The techniques of big data classification are categorized into eight, namely CNN based techniques, NN based techniques, Deep learning based techniques, KNN based techniques, Fuzzy based techniques, SVM based techniques, Optimization based techniques, and Decision Tree based techniques. The review of big data classification techniques provide a clear visualization of the suggested methods along with their significances and drawbacks.

2.1 Classification of big data classification techniques

The distinct research works adopting various big data classification techniques are elaborated in this section.

2.1.1 CNN based classification techniques

The research papers employing CNN based techniques in the big data classification systems are dis-cussed below.

Plathottam et al. [16] presented CNNs for huge datasets. Here, CNN is employed to control complex power grids. The CNN is trained by power data for performing multi-class multi-label classification from Midcontinent ISO (MISO). Tensor Flow is utilized for constructing CNN and to train the network.

Qin et al. [29] developed CNN to control features from raw data and mitigate the application of external resources as well as tool kits. Entity Tag Feature (ETF) is utilized for indicating the location of target nominal, which is easy, but more effective than Position Feature (PF). Dropout strategy is used for alleviating the over-fitting issue.

Cui et al. [32] developed Multi-scale Convolutional Neural Network (MCNN) in a single framework for extraction of features and classification. Based on learnable convolutional layers and multi-branch layer, MCNN automatically extracts features at various frequencies and scales for feature representation. MCNN is computationally efficient, as it naturally leverages Graphics Processing Unit (GPU) computing. However, the method did not use time series classification by other side information from various sources, like image, text, and speech. Deepak Puthal et al. [43] developed Dynamic Key-Length-Based Security Framework (DLSeF) from synchronized prime numbers. Here, the key is updated dynamically at short intervals for ensuring end-to-end security. The training is performed using CNN model for predicting lights at the night time from daytime imagery, at the same time learning features is used to predict poverty.

Dong [46] presented CNN based multiclass classification approach from Electrical Medical Records (EMRs). This framework comprises of two phases. In the first phase, pre-processing is performed with word embedding to represent the samples. In the second phase, training data is segmented to various subsets and training is done based on CNN on every subset.

2.1.2 NN based classification technique

The distinct research works employing the NN based techniques for the big data classification systems are presented below, Shan et al. [21] addressed stochastic optimization approach and applied Shuffled Frog-Leaping Algorithm (SFLA) in NN based classifier based parameter optimization. Initially, the NN classifier is developed and comparison is performed with SVM. NN is essential for huge database and it is difficult for extracting high level data. Then, the data set is established, covering speech and cancer data. Both databases have a large number of sample data with complex low level variance. After that, the NN parameter is optimized based on modified version of SFLA.

Sharma and Mangat [31] developed Relevance Vector Machine (RVM) based on data mining approach for big data classification using epidemic outbreak of Ebola virus and compared with other epidemic diseases and generalizing error and intra class separation is performed based on RVM classifier. Dai et al. [50] developed parallel randomized NN for classification. In this framework, optimization of the RNN parallelization is introduced for reducing the communication overhead in the Shuffle phase, like cache-based optimization, partitioning, and thus, enhances the performance and scalability.

2.1.3 Deep learning based classification approaches

Alsheikh et al. [2] presented scalable spark based approach for deep learning in big data. Deep learning uses a tool for adding value from raw mobile big data. Particularly, distributed deep learning is implemented based on Map Reduce on several spark workers. Every worker is capable of learning a partial deep model on a partition of the overall mobile, and a master deep framework is established for all partial models.

Li et al. [6] developed deep convolutional computation model using tensor representation for big data learning features from the vector space to tensor space. Tensor based approach is utilized for representing every heterogeneous object for revealing the hidden correlations. Tensor CNN uses current CNN to share the parameters. Additionally, a tensor full-connected layer, and a tensor pooling layer are devised for learning high-order features. Back-propagation approach is established for training the parameters of deep convolutional computation model.

Double-projection Deep Computation Model (DPDCM) is developed by Zhang et al. [8] for feature learning. DPDCM projects nonlinear subspaces for learning interacted features to replace each layer based on double-projection layer. Then, mapping is performed using weight sensor for modelling the interactions. After that, back propagation approach is introduced for training the parameters of DPDCM. At last, Privacy Preserving Double Deep Computation Model (PPDPDCM) is established for preventing private information.

Deng et al. [13] developed Fuzzy Deep Neural Network (FDNN) for extracting data from neural and fuzzy representations. The knowledge gained from these two views is merged to form final representation. Here, fuzzy representation mitigates the uncertainties and neural representation eliminates the noises in the raw data. Then, the developed approach used two best representations for final classification. Hence, FDNN is essential for applying data noise, and ambiguity.

Ravi et al. [14] developed deep learning approach for data classification. Initially, pipeline gathers the original data from the inertial sensors. Then, the input data is extracted into segments, in which the features from shallow and deep learning method are estimated in parallel. At last, these two feature sets are combined together and classification is performed based on soft-max and fully connected layers.

Long et al. [18] designed Deep Adaptation Network (DAN) for enhancing the feature transferability. The hidden representations are embedded for producing kernel Hilbert space, in which the embedding’s of various domain distributions are grouped. The domain discrepancy based on optimal multi-kernel selection approach for mean embedding matching is then minimized.

Xiong et al. [24] examined deep learning approach for geosciences data. This approach is utilized for learning meaningful patterns from huge amount of input data to map minerals in the south western Fujian metal organic zone of China. This framework is relevant for discovering difficult features required to detect and classify from the original data.

Koyamada et al. [25] developed DNN-based subject-transfer decoder. Large-scale functional Magnetic Resonance Imaging (fMRI) database is applied for attaining maximum decoding accuracy than other baseline methods. For addressing the knowledge obtained based on decoder, Principal Sensitivity Analysis (PSA) is applied to visualize the discriminative features for all subjects in the dataset.

Zou et al. [26] modelled Tencent deep learning approach for training, simplifying, and support large model in Tencent. Mariana includes three frameworks. They are multi-GPU model parallelism and data parallelism framework, a CPU cluster framework for large scale DNNs.

Kuang and He [30] presented DBN for feature extraction and classification. This framework is showed to be effective in discriminating Attention Deficit Hyperactivity Disorder (ADHD) from subtypes and control. The accuracy is enhanced with the results published in ADHD-200 Competition.

Shafiq and Torunski [49] presented formal model to structure and organize logs. Then, Bayesian technique is used for predicting and detecting any possible faults. Also, Map Reduce based distributed, single-pass, incremental approach is introduced to build, train and implement the developed framework.

2.1.4 KNN based classification techniques

The research papers utilizing the KNN based approaches for the big data classification systems are elaborated as follows.

Maillo et al. [3] developed KNN classifier for big data classification. Here, the map phase determines k-NN in various splits of the data. Then, the reduce stage is used for computing the definite neighbours from the list attained in the map phase. This approach employs KNN for scaling just by adding more computing nodes. This execution involves the better classification rate based on the original KNN model.

Ramírez-Gallego et al. [5] employed KNN classifier using apache spark for processing massive and high speed data streams. This method is designed for high speed, large scale, and streaming issues. This approach organizes the examples based on distributed metric tree, which contains top level tree for query routing and a set of distributed sub trees for performing the parallel searches. Then, the instance selection approach is introduced for updating and eliminating outdated instances from the case-base.

Hassanat [9] developed KNN classifier for inserting training instances to binary search tree to speed up the searching time for text samples. Here, two approaches are utilized for sorting the training examples. The primary one is the Norm-based Binary Tree (NBT) for computing maximum or minimum scaled norm. Instances with 1-norms are arranged in the right child, while with 0-norms are arranged in the left child of the node. This process is repeated till the similar norm in a leaf node is obtained. Secondary one is the improved NBT. Here, each example is inserted to search tree depending on the maximum, and the minimum Euclidean norms. At last, sequential KNN classifier is used for classifying feature vectors.

Popescu and Keller [54] presented a fusion strategy for Random Projections Fuzzy KNN (RPFKNN) for big data classification. This method is introduced using the values of class membership, is achieved in each projection using classification accuracy and FKNN.

2.1.5 Fuzzy based classification schemes

The research works adopting the fuzzy based techniques for the big data classification techniques are discussed below.

Fernandez et al. [17] designed evolutionary Fuzzy Rule Based Classification Systems (FR-BCSs) for classification. It consists of Knowledge Base (KB) construct by mans of the Chi-FRBCS-Bi Data approach, along with genetic tuning of the database based on 2-tuples representation. After that, the KB from every map are merged together to produce an ensemble classifier.

Iniguez et al. [19] presented Chi Big Data Support Filtering for classifications. Initially, the classification accuracy and geometric mean are considered. Then, the interpretability of the system is analyzed using rule base size. At last, the scalability is performed based on elapsed training times.

El Bakry et al. [20] developed fuzzy KNN classifier based on Map Reduce for classification. This approach comprises of mapper and reducer. The mapper is utilized for dividing the datasets to chunks and generates intermediate records. These records are produced using map function in the “key, dat” set form. Mapper executes the computing process and passes the results to the reducer. Then, the outputs obtained from the reducer to achieve the final output.

Akil Kumar et al. [22] presented FRBCS for solving the problems in big data, termed Chi-FRBCS-BigData. This approach uses interpretable model for managing vast data and provides accuracy with optimal performance time with Hadoop framework.

Segatori et al. [28] developed distributed Fuzzy Decision Tree (FDT) based on Map Reduce framework to generate multi path and binary way FDTs from big data. This approach uses a distributed fuzzy discretized that produces fuzzy partition for every repeated attribute using fuzzy information. The fuzzy partitions utilized FDT learning for choosing attributes at the decision nodes.

Azar and Hassanien [36] developed Linguistic Hedges Neuro-Fuzzy Classifier with Selected Features (LHNFCSF) for dimensionality reduction, selection of features and classification. In this framework, connection weights, propagation, and activation functions vary from NN.

2.1.6 SVM based classification approaches

The research works practicing the SVM based techniques for the big data classification are discussed below.

Rebentrost et al. [1] used optimized binary classifier for classification. A least-square formulation of the SVM permits the usage of phase computation and the quantum matrix inversion approach. The speed of the quantum approach is high, when the kernel matrix is dominated using less number of principal components. The quantum technique does not require representation of all the features of every training example, but it generates the appropriate data structure. After generating the kernel matrix, the individual features of the training data are fully hidden from the user.

Bishwas et al. [7] employed all-pair quantum multiclass-SVM for classification of big data. Here, k(k-1)/2 classifier is established for classifying the given hidden quantum query state. Classical multiclass-SVM can be executed with polynomial run tim; this framework uses quantum version for speed up the system. This framework uses various classification approaches, and speed up gain is attained with respect to their classical counterparts.

Bishwas et al. [12] developed multiclass SVM and Quantum One-Against-All Approach for handling the classification of big data. At first, k quantum binary classifiers are introduced. Then, these classifiers are utilized for classifying unknown quantum state and the prediction is done based on the maximum confidence score.

Zou [38] developed Max-Relevance-Max-Distance (MRMD) feature ranking technique that manages stability and accuracy of feature prediction as well as ranking task. This framework avoids the difficult calculation for ensuring the stability of feature selection.

2.1.7 Optimization based classification techniques

The research works practicing the optimization based techniques for the big data classification are discussed below.

Meera and RosilineJeetha [23] developed Acceleration Artificial Bee Colony-Artificial Neural Network (AABC-ANN) to solve high dimensionality and feature selection issues. This framework comprises of pre-processing, selection of features and classification. In the pre-processing step, the KNN approach is utilized for removing irrelevant, and redundancy attributes, for managing the unknown values. After that, the feature selection is done based on AABC for generating the optimal values. The selected features are fed to training and testing. Based on ANN classifier, the features are classified more precisely.

Hegde and Mundada [47] developed Cognitive Particle Swarm Optimization (PSO) in integration with Deep Learning for Big Data classification. This approach is utilized for extracting the features from the given dataset and classification is carried out based on Deep learning. The developed method can adjust its learning rate and the features are extracted efficiently from the obtained data.

Manoj et al. [52] developed Ant Colony Optimization (ACO) and Artificial Neural Network (ANN) based hybrid algorithm for feature selection to text categorization. This hybrid algorithm is found optimal and effective. Here, ACO has the ability of congregating promptly since it has search ability in the state space problem and could find minimal feature subset efficiently.

Lin et al. [55] developed improved Cat Swarm Optimization (CSO) for big data classification. Here, Improved CSO (ICSO) is applied for feature selection, and finally, the classification is performed using SVM. The method did not consider other methods to improve the traditional mode of CSO.

2.1.8 Decision tree based classification approaches

The research works adopting the decision tree classifiers for the big data classification are discussed below.

Triguero et al. [4] developed big data approach based on apache spark to tackle lack of density issues. Initially, the whole training dataset is divided to chunks, and the extraction is performed using positive samples. After that, the positive sets are broadcasted, so that all the nodes have distinct in-memory copy of the positive samples. For every negative chunk data, a subset of data is obtained based on the positive sample set. After that, Evolutionary under Sampling (EUS) is applied for mitigating the size of both classes and improve the classification performance. At last, various models are merged for predicting the classes of the test set.

Bhagat and Patil [11] developed Synthetic Minority Oversampling Technique (SMOTE) for classification. Initially, Binarization techniques are introduced for decomposing raw dataset to binary class subsets. Then, SMOTE approach is given for every subset of imbalanced binary class for data balancing. At last, Random Forest (RF) classifier is utilized for classification.

Bifet et al. [39] introduced evaluation techniques for streaming big data. This framework analyses unbalanced data streams, in which modification attained on various time scales, and the data is from testing and training from different models. Leszek et al. [40] presented decision trees for data streaming. A new measure for dividing the tree nodes is introduced using impurity measure termed error misclassification. This approach is utilized for deciding the better attribute determined from the whole data stream.

2.1.9 Other classification methods

The other classification methodologies adopted for big data classification are elaborated below.

Liu et al. [10] developed easy and entire system for classification on huge datasets based on NB classifier. Here, additional modules are required for automating the experiment. Yan et al. [15] Developed scalable classifier ensemble technique for combining the results from various classifiers to classify big data. Initially, various“judger” are generated from features and classifiers using training and validation outputs. Then, the judgers are ranked and joined as a boosted classifier. At last, spark-based classification framework is introduced for big data classification.

Peixoto et al. [27] developed Hierarchical Multi-Label Classification (HMC), termed Semantic Hierarchical Multi-Label Classification, for big data classification. Hierarchization, Vectorization, Indexation, Resolution, and Realization, are the steps involved in Semantic HMC. Here, the primary three phases performed label hierarchy from data sources. The remaining two phases are utilized for classifying new items based on the hierarchy labels.

Fong et al. [33] presented lightweight feature selection, termed Swarm Search with Accelerated Particle Swarm Optimization for classification. Here, comparative performance analysis is performed over two groups of classification algorithms, namely batch learning and incremental or data-stream learning, in the light of achieving top accuracy at the shortest possible pre-processing times.

Triguero et al. [34] presented Map Reduce solution for Prototype Reduction (MRPR) for classification. This approach is used to tackle issue that did not comprise of several thousands of examples, because of memory and runtime restrictions. The Map Reduce has transparent simple, and effective environment for prototype reduction computation. Three various reduce types, like Join, Fusion, and Filtering are also analysed for providing more accurate pre-processed sets.

Wu et al. [35] developed HACE for Big Data revolution features, and introduced Big Data processing model from the perspective of data mining. This approach uses mining, demand-driven aggregation of information sources, security as well as privacy considerations, and modelling of user interest.

Chen [37] presented Hierarchical Extreme Learning Machine (H-ELM) using Flink is one of the Graphics Processing Units (GPUs), and in-memory cluster computing platforms. Various optimization techniques are employed for improving performance, like reasonable partitioning strategy, cache-based approach, and memory mapping for mapping specific Java virtual machine objects for buffering. This framework is used for accelerating several variants of ELM and machine learning approaches.

Puthal et al. [41] developed Dynamic Prime Number based Security Verification (DPBSV) for big data streaming. This approach uses common shared key for updating dynamically to generate synchronized pairs of prime numbers. Li et al. [42] developed Privacy-Preserving Outsourced Classification in Cloud Computing (POCC) approach for processing and storing. Proxy fully holomorphic encryption mechanism Using Gentr’s is utilized for preserving the privacy data. Based on POCC, the Crypto Service Provider (CSP) and S evaluator are used to train a classification model from encrypted data with various public keys. This model is then stored in encrypted format in the evaluator that is used for providing service for the clients. However, in between CSP and S, there are some connections. The method failed to mitigate the computation cost and communication.

Vu et al. [45] developed first distributed streaming algorithm for learning decision rules. This framework utilizes horizontal and vertical parallelism for distributing Adaptive Model Rules (AMRules) on a cluster. The decision rules introduced by AM are comprehensible models, in which the antecedent rule is a conjunction on the attribute values, and the consequent is nothing, but the combination of the attributes.

Twardowski and Ryzko [48] presented multi-agent architecture for Big Data processing. This model uses lambda system and it shows autonomous agents are utilized for enhancing architecture, providing robust processing of data in real-time.

Zhai et al. [51] presented fuzzy based ELM classifier for classification. Initially, for every positive case, its nearest neighbor enemy is identified using Map Reduce, and p positive are produced in a random manner in its enemy nearest neighbor based on uniform distribution. After that, balanced Datasets are established and l classifier is trained using constructed data subsets with non-iterative learning. At last, the trained classifier is combined using fuzzy integral for classifying unseen cases.

Bifet et al. [53] developed Scalable Advanced Massive Online Analysis (SAMOA) tool for big data stream mining. This framework is used to gather distributed streaming approaches for machine learning and data mining tasks, like classification, clustering, regression, and programming abstractions. Also, this method measures a pluggable design to run on various processing engines in a distributed manner, namely Samza as well as Storm S4.

3. Research gaps identified

This section deals with the various research gaps and challenges existing in the literature works Elaborated, as follows.

The limitations of deep learning based approaches are as follows: In [2], deep learning classifier is utilized for speeds up the deep learning model consists of several hidden layers and millions of parameters on a computing cluster. The method in [13] did not consider linear regression model for price value prediction. In [30], DBN does not consider personal characteristic data and Functional Magnetic Resonance Imaging (fMRI)-based information for analysis.

The challenges faced by CNN based classification techniques were as follows: The method in [29] did not consider target nominal for classification, and also, failed to add some linguistic prior knowledge into system. The major challenges in the NN based data classification approaches are: NN [21] failed to consider more data set and compare Shuffled Frog-Leaping Approach (SFLA) with stochastic search algorithms. The challenges faced by the KNN based classification of data are, KNN classifier [3] did not use recent techniques, like spark for faster computation.

The major challenges in the fuzzy based classification approaches are deliberated as follows: The method did not consider other programming model, like Spark for improving the efficiency and scalability [17]. In [20], fuzzy KNN classifier does not consider fuzzy mechanism in the reducer. The technique in [22] requires further improvement in the performance using techniques, like FCBF and Bagging algorithm. The method in [36] did not estimate neural fuzzy in various medical diagnosis issues, such as micro array gene selection, and other data mining problems.

The challenging issues of the optimization based data classification techniques are, ACO [52] did not introduce other classifiers, and failed to use other kinds of data, such as audio, video, images, and so on.

The challenges faced by the Decision tree based data classification techniques are as follows: The method in [4] did not consider efficient big data strategies that can deploy pre-processing procedures, like under sampling as well as oversampling approaches in the big data context. Synthetic Minority Oversampling Technique (SMOTE) [11] did not address the problem caused by oversampling rate as it varies with respect to the dataset. The method failed to consider regression, multi-label, and multi-target learning [39].

The challenges faced by other approaches are: The method in [10] failed to consider information fusion over text and imagery, cyber analysis using cloud computing, and distributed robotics applications. Semantic HMC failed to add integrate between the Reduce and the Map phases to minimize computation time and size of the data to the reducers [27]. In [45], the method did not consider new approaches to deal with both high dimensional and large-scale datasets. In [34], the method did not use structured and multi-target learning.

4. Analysis and discussion

The analysis and discussion of the distinct research works for the big data classification in accordance to the employed framework, utilized datasets, accuracy range, evaluation metrics, implementation tool, and adopted classification methods are elaborated in this section.

4.1 Based on the methodologies employed

The analysis performed on the basis of employed big data classification schemes is elaborated in this subsection. The distinct classification methodologies adopted for the effective classification of big data are depicted in Fig. 2. From the Fig. 2, it is conveyed that among 55 research papers, about 11 research papers, i.e. 27%, have used deep learning classifier, 11% of the research works have utilized Fuzzy classifier. The CNN classifier is adopted in 12% of the researches, whereas, 7% of the researches, i.e. 3 research papers, employ NN classifier and the remaining works employed other techniques, such as KNN, SVM, and Optimization, and decision tree classifier.

4.2 Based on implementation tools

This subsection deliberates the analysis carried out considering the simulation tools employed in the research papers. Table 1 represents the distinct implementation tools employed for performing the effective big data classification. The commonly practiced implementation tools are Java, Cloudera, MATLAB, Hadoop, and Weka. From Table 1, it is clearly shown that the MATLAB tool is the frequently adopted implementation tool for effective big data classification.

4.3 Based on evaluation metrics

This subsection demonstrates the analysis based on the different evaluation metrics used in the re-search papers.

Table 1
Analysis based on the implementation tools used for big data classificatio

Implementation tools	Research paper
Java	[2]
Cloudera	[3, 4, 5]
Matlab	[23, 36, 41, 44, 46, 51]
Weka	[38]
Others	[6, 8, 14, 16, 26, 29, 21, 29, 53]

Figure 2.

Analysis based on the classification schemes.

Figure 3.

Analysis based on evaluation metrics utilized in classification.

Figure 4.

Analysis based on data set utilized in the classification approaches.

The analysis chart showing the research papers for big data classification system based on evaluation metrics, like time, speedup, accuracy, F-measure, run time, and precision is depicted in Fig. 3. From the Fig. 3, it is conveyed that among 55 research papers, about 23 research papers, i.e. 46%, have used accuracy metric and 20% of research works have utilized time. Speedup metric is adopted in 12% of the researches, whereas, 8% of the researches, i.e. 4 research papers, employ other metrics. Nearly, 6% of the research works are employed with the precision metric, 4% of the papers used F-measure and the run time is employed in 2 research works. Thus, it is seen that most of the research papers have used the accuracy metric for the big data classification.

4.4 Based on accuracy range

The analysis carried out in the aspect of the accuracy range is elaborated in this subsection. Table 2 represents the analysis based on the classification accuracy. From the table, it is elucidated that the accuracy value within the range of 60%–70% is achieved by 5 research papers and 70%–80% is attained by 8 research works. Then, the accuracy between 80%–90% is achieved by 2 research works and 5 research papers attained the accuracy in the range of 90%–99.9%.

4.5 Based on datasets employed

Table 2
Analysis based on classification accurac

Accuracy range	Research paper
60%–70%	[19, 25, 30, 40, 50]
70%–80%	[8, 9, 13, 17, 20, 22, 33, 39]
80%–90%	[10, 18]
90%–99.9%	[5, 6, 16, 23, 47]

The analysis carried out regarding the datasets employed by the distinct research works is elaborated in this subsection. Figure 4 depicts the different datasets used for the big data classification. The commonly utilized datasets in the distinct research works are real-world, poker hand, UCI dataset repository, cover type, electricity, ADHD, and flink datasets. From Fig. 4, it can be shown that the most frequently utilized datasets are UCI dataset repository, and real-world dataset.

4.6 Based on the frameworks employed

This subsection deliberates the analysis carried out by considering the adopted frameworks for the big data classification using Table 3. Through this analysis, it is clearly shown that among the 55 research papers, about 12 research papers have used MR framework and 6 research works are adopting Hadoop framework. Then, 4 research papers have used Apache Hadoop framework and 7 research works have utilized Apache Spark framework. Thus, it is clear that the commonly employed framework to handle big data is the MR framework.

Table 3
Analysis based on frameworks used for big data classificatio

Frameworks	Research paper
Map Reduce framework	[3, 10, 11, 17, 19, 20, 27, 37, 48, 49, 50, 51]
Apache Hadoop framewor	[17, 21, 34, 35]
Apache spark framework	[2, 4, 5, 11, 15, 28, 39]
Hadoop framework	[10, 22, 44, 45, 48, 54]

5. Conclusion

A survey on different data stream mining schemes employed for the effective classification of big data is elucidated in this study. The primitive goal of this paper is to review, categorize and learn the distinct classification utilized for the big data classification through the analysis of various research papers from IEEE, Google Scholar, Elsevier and Science Direct etc. The analysis and discussion were made concerning the evaluation metrics, adopted classification methods, utilized datasets, implementation tool, employed framework, and accuracy range.

Also, this survey suggests the major future scope for the effective big data classification by considering the research gaps and issues. In accordance with the review and analysis, it can be concluded that the accuracy is widely concerned evaluation metric in most of the research works for exploring the effectiveness of big data classification. The vastly adopted framework is MR framewor; and the frequently employed classification methodologies are the deep learning classifier and the fuzzy classifier.

References

Mohseni

Rebentrost

and Lloyd

, Quantum support vector machine for big data classification, Physical Review Letters 113(13) (2014), 130503.

Niyato

Tan

H.P.

Alsheikh

M.A.

Lin

and Han

, Mobile big data analytics using deep learning and apache spark, IEEE Network 30(3) (2016), 22–29.

Herrera

Triguero

and Maillo

, A map reduce-based k-nearest neighbour approach for big data classification, In Trustcom/BigDataSE/ISPA 2 (August 2015), 167–172.

Merino

Herrera

Bustince

Triguero

Maillo

and Galar

, Evolutionary under sampling for extremely imbalanced big data classification under apache spark, In Evolutionary Computation (CEC), July 2016, 640–647.

Krawczyk

Herrera

Benitez

J.M.

Wozniak

Garcia

and Ramírez-Gallego

, Nearest neighbour classification for high-speed big data streams using spark, IEEE Transactions on Systems, Man, and Cybernetics: Systems 47(10) (2017), 2727–2739.

Yang

L.T.

Deen

M.J.

Zhang

and Chen

, Deep convolutional computation model for feature learning on big data in internet of things, IEEE Transactions on Industrial Informatics 14(2) (2018), 790–798.

Yang

L.T.

Deen

M.J.

Zhang

and Chen

, Privacy-preserving double-projection deep computation model with crowdsourcing on cloud for big data feature learning, IEEE Internet of Things Journal 5(4) (2018), 2896–2903.

Bishwas

A.K.

Mani

and Palade

, An all-pair quantum SVM approach for big data multiclass classification, Quantum Information Processing 17(10) (2018), 282.

Hassanat

, Norm-based binary search trees for speeding up KNN big data classification, Computers 7(4) (2018), 54.

10.

Liu

Shen

Blasch

Chen

and Chen

, Scalable sentiment classification for big data analysis using naive bayes classifier, in: Proceedings of International Conference on Big Data, IEEE, October. 2013, pp. 99–104.

11.

Bhagat

R.C.

and Patil

S.S.

, Enhanced SMOTE algorithm for classification of imbalanced big-data using random forest, in: Proceedings of International Advance Computing Conference (IACC), IEEE, June 2015, pp. 403–408.

12.

Bishwas

A.K.

Mani

and Palade

, Big data classification with quantum multiclass SVM and quantum one-against-all approach, in: Proceedings of 2nd International Conference on Contemporary Computing and Informatics, December 2016, pp. 875–880.

13.

Bao

Dai

Kong

Deng

and Ren

, A hierarchical fused fuzzy deep neural network for data classification, IEEE Transactions on Fuzzy Systems 25(4) (2017), 1006–1012.

14.

Wong

Ravi

and Yang

G.Z.

, A deep learning approach to on-node sensor data analytics for mobile or wearable devices, IEEE Journal of Biomedical and Health Informatics 21(1) (2017), 56–64.

15.

Shyu

M.L.

Zhu

Chen

S.C.

and Yan

, A classifier ensemble framework for multimedia big data classification, in: Proceedings of 17th International Conference on Information Reuse and Integration (IRI), IEEE, July 2016, pp. 615–622.

16.

Salehfar

Ranganathan

and Plathottam

S.J.

, Convolutional Neural Networks (CNNs) for power system big data analysis, in: Power Symposium (NAPS), IEEE, September 2017, North American pp. 1–6.

17.

Fernández

Herrera

and del Río

, A first approach in evolutionary fuzzy systems based on the lateral tuning of the linguistic labels for big data classification, in: Proceedings of International Conference on Fuzzy Systems, July 2016, pp. 1437–1444.

18.

Wang

Jordan

M.I.

Long

and Cao

, Learning transferable features with deep adaptation networks, 2015.

19.

Fernandez

Íñiguez

and Galar

, Improving Fuzzy Rule Based Classification Systems in Big Data via Support-based Filtering, in: Proceedings of International Conference on Fuzzy Systems, July 2018, pp. 1–8.

20.

El Bakry

Hegazy

and Safwat

, Big data classification using fuzzy K-nearest neighbor, International Journal of Computers and Applications 132(10) (December 2015), 8–13.

21.

Nie

S.P.

and Shan

, Shuffled frog-leaping algorithm based neural network and its using in big data set, in: Proceedings of International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, July 2017, pp. 707–711.

22.

Akil Kumar

Mithunkumar

Anitha

Kiruthiga

and Priya

, Improved Fuzzy Rule Based Classification System Using Feature Selection and Bagging for Large Datasets, International Journal of Science and Research, 2015.

23.

Jeetha

B.R.

and Meera

, Acceleration artificial bee colony optimization-artificial neural network for optimal feature selection over big data, in: Proceedings of International Conference on Power, Control, Signals and Instrumentation Engineering, September 2017, pp. 1698–1706.

24.

Carranza

E.J.M.

Zuo

and Xiong

, Mapping mineral prospectivity through big data analytics and a deep learning algorithm, Ore Geology Reviews 102 (2018), 811–817.

25.

Nakae

Koyama

Ishii

Koyamada

and Shikauchi

, Deep learning of MRI big data: a novel approach to subject-transfer decoding, 2015.

26.

Xiao

Wang

Jin

Zou

and Guo

, Mariana: Tencent deep learning platform and its applications, In Proceedings of the VLDB Endowment 7(13) (2014), 1772–1777.

27.

Bertaux

Cruz

Silva

Peixoto

and Hassan

, Semantic HMC: A predictive model using multi-label classification for big data, In Trustcom/BigDataSE/ISPA, IEEE 2 (August 2015), 173–179.

28.

Segatori

Marcelloni

and Pedrycz

, On distributed fuzzy decision trees for big data, IEEE Transactions on Fuzzy Systems 26(1) (2018), 174–192.

29.

Guo

Qin

and Xu

, An empirical convolutional neural network approach for semantic relation classification, Neurocomputing, 2016, 1–9.

30.

Kuang

and He

, Classification on ADHD with deep learning, in: Proceedings of International Conference on Cloud Computing and Big Data (CCBD), November 2014, pp. 27–32.

31.

Sharma

and Mangat

, Relevance vector machine classification for big data on Ebola outbreak, in: Proceedings of 1st International Conference on Next Generation Computing Technologies (NGCT), September. 2015, pp. 639–643.

32.

Chen

and Cui

, Multi-scale convolutional neural networks for time series classification, 2016.

33.

Vasilakos

Wong

and Fong

, Accelerated PSO swarm search feature selection for data stream mining big data, IEEE Transactions on Services Computing 1 (2016), 1–1.

34.

Peralta

Herrera

Triguero

Bacardit

and García

, MRPR: A MapReduce solution for prototype reduction in big data classification, Neurocomputing, 2015, 331–345.

35.

G.Q.

Ding

and Zhu

, Data mining with big data, IEEE Transactions on Knowledge and Data Engineering 26(1) (2014), 97–107.

36.

Hassanien

A.E.

and Azar

A.T.

, Dimensionality reduction of medical big data using neural-fuzzy classifier, Soft Computing 19(4) (2015), 1115–1127.

37.

Ouyang

Chen

and Tang

, GPU-accelerated parallel hierarchical extreme learning machine on flink for big data, IEEE Transactions on Systems, Man, and Cybernetics: Systems 47(10) (2017), 2740–2753.

38.

Zeng

Cao

Zou

and Ji

, A novel features ranking metric with application to scalable visual and bioinformatics data classification, Neurocomputing 173 (2016), 346–354.

39.

Bifet

Pfathringer

de Francisci Morales

Holmes

and Read

, Efficient online evaluation of big data stream classifiers, in: Proceedings of21th International Conference on Knowledge Discovery and Data Mining, August 2015, pp. 59–68.

40.

Rutkowski

Pietruczuk

Jaworski

and Duda

, A new method for data stream mining based on the misclassification error, IEEE Transactions on Neural Networks and Learning Systems 26(5) (2015), 1048–1059.

41.

Puthal

Chen

Ranjan

and Nepal

, DLSeF: A dynamic key-length-based efficient real-time security verification model for big data stream, ACM Transactions on Embedded Computing Systems (TECS) 16(2) (2017), 51.

42.

Gao

C.Z.

Chen

Huang

and Chen

W.B.

, Privacy-preserving outsourced classification in cloud computing, Cluster Computing, 2017, 1-10.

43.

Lobell

Burke

Xie

Jean

and Ermon

, Transfer learning from deep features for remote sensing and poverty mapping, in: Proceedings of13th AAAI Conference on Artificial Intelligence, March 2016.

44.

De Francisci Morales

, SAMOA: A platform for mining big data stream, in: Proceedings of 22nd International Conference on World Wide Web, May 2013, pp. 777–778.

45.

Bifet

A.T.

Morales

G.D.F.

and Gama

, Distributed adaptive model rules for mining big data streams, in: Proceedings of International Conference on Big Data, October 2014, pp. 345–353.

46.

Yang

Huang

Qian

Dong

and Guan

, A multiclass classification method based on deep learning for named entity recognition in electronic medical records, in: Scientific Data Summit (NYSDS), 2016 August, pp. 1–10.

47.

Mundada

M.R.

and Hegde

, A Hybrid Approach of Deep Learning with Cognitive Particle Swarm Optimization for the Big Data Analytics, in: Proceedings of 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), July 2018, pp. 1–5.

48.

Twardowski

and Ryzko

, Multi-agent architecture for real-time big data processing, in: Proceedings of International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies, Vol. 3, August 2014, pp. 333–337.

49.

Torunski

and Shafiq

M.O.

, Towards Map Reduce based Bayesian deep learning network for monitoring big data applications, in: Proceedings of International Conference on Big Data (Big Data), December 2017, pp. 2112–2121.

50.

Chen

and Dai

, A parallel randomized neural network on in-memory cluster computing for big data, in: Proceedings of 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), July 2017, pp. 1769–1778.

51.

Zhai

Zhang

and Liu

, Fuzzy integral-based ELM ensemble for imbalanced big data classification, Soft Computing 22(11) (2018), 3519–3531.

52.

Vijayakumar

Praveena

M.A.

and Manoj

R.J.

, An ACO-ANN based feature selection algorithm for big data, Cluster Computing, March 2018, 1–8.

53.

Bifet

and Morales

G.D.F.

, Bigdata stream learning with Samoa, in: Proceedings of International Conference on Data Mining Workshop, December 2014, pp. 1199–1202.

54.

Keller

J.M.

and Popescu

, Random projections fuzzy k-nearest neighbour (RPFKNN) for big data classification, in: Proceedings of International Conference on Fuzzy Systems, July 2016, pp. 1813–1817.

55.

Hung

J.C.

Lin

K.C.

Zhang

K.Y.

Yen

and Huang

Y.H.

, Feature selection based on an improved cat swarm optimization algorithm for big data classification, The Journal of Supercomputing 72(8) (2016), 3210–3221.

56.

Suthaharan

, Big data classification: Problems and challenges in network intrusion prediction with machine learning, ACM Sigmetrics Performance Evaluation Review 41(4) (2014), 70–73.

57.

Chris Eaton, Dirk deRoos, Thomas Deutsch, George Lapis, Paul C. Zikopoulos, Understanding Big Data, October 2011.

58.

Gandhi

B.S.

and Deshpande

L.A.

, The survey on approaches to efficient clustering and classification analysis of big data, in: Proceedings of International Conference on Computing Communication Control and Automation (ICCUBEA), August 2016, pp. 1–4.

Literature review and analysis on big data stream classification techniques

Abstract

Keywords

1. Introduction

2.1 Classification of big data classification techniques

2.1.1 CNN based classification techniques

2.1.2 NN based classification technique

2.1.3 Deep learning based classification approaches

2.1.4 KNN based classification techniques

2.1.5 Fuzzy based classification schemes

2.1.6 SVM based classification approaches

2.1.7 Optimization based classification techniques

2.1.8 Decision tree based classification approaches

2.1.9 Other classification methods

3. Research gaps identified

4. Analysis and discussion

4.1 Based on the methodologies employed

4.2 Based on implementation tools

4.3 Based on evaluation metrics

Table 1 Analysis based on the implementation tools used for big data classificatio

4.5 Based on datasets employed

Table 2 Analysis based on classification accurac

Table 3 Analysis based on frameworks used for big data classificatio

References

Table 1
Analysis based on the implementation tools used for big data classificatio

Table 2
Analysis based on classification accurac

Table 3
Analysis based on frameworks used for big data classificatio