PGNBC: Pearson Gaussian Naïve Bayes classifier for data stream classification with recurring concept drift

Abstract

In data stream classification, selecting the classifier for the dynamic feature space and considering the concept drift is a challenging task. This paper addresses the major challenges in the data stream classification with recurring concept drift. We developed a novel classification method known as Pearson Guassian Naïve Bayes classification (PGNBC). The proposed PGNBC method is the advancement over the existing Guassian Naïve Bayes classifier (GNBC) by additionally adding the correlation among the attributes. For the data stream classification, the proposed PGNBC is frequently updated based on the concept drift. This newly developed method is experimented by comparing the results with the existing methods such as RGNBC and MReC-DFS. The metrics such as sensitivity, specificity and accuracy are used for measuring the performance. It is found that the improvement in terms of sensitivity, specificity and accuracy values are better for the proposed method, with the values of 4%, 1% and 1% respectively, which is higher for the PGNBC method than the RGNBC method for the skin data. But with the localization data, the improvement in terms of specificity and accuracy values are 6% and 2% respectively which is higher than the RGNBC.

Keywords

Data stream recurring concept drift Naïve Bayes rough set theory classification

1. Introduction

Data stream mining includes the patterns from the continuous data streams and uses the Informational structure extraction. A lot of works have been done in the area of data stream mining. In static environments, the data set is subjected to learning algorithm and the data stream mining fully belongs to those types of environments. Also, in this type of environments, the main area of interest is in the data which is devoted to the machine learning algorithms. Additionally, the concept of target must be learned. As a solution to the static classification problem, many types of classifiers are developed. In contrast to this, some learning algorithms are used in the dynamic environment also. Sensor networks, telecommunication, traffic management, web log analysis and monitoring [5] are some of the examples cited for the application of learning algorithms in dynamic environment. In dynamic environment of data stream mining, the classification of data is quite challenging.

Data mining is a very important tool in classification and prediction. For the identification of the set of collected data which explains the model characteristics, helping in the prediction of the unknown variables, classification can be done. Constructing a model is the important part in classification. Naïve Bayes classifier is one of the widely used classifier having a probabilistic approach with independent assumption property, which is not satisfied. To overcome and improve the performance, Bayes classification is developed in [19]. To the Naïve Bayes classifier, additional features are added for better performance and efficiency enhancement. The structure of the Naïve Bayes classifier is extended to show the dependencies among the attributes. Also, different weights are assigned to the attributes for building the Naïve Bayes classifier. The irrelevant attributes are denied using the feature selection approach.

The problem of concept drift [1, 2, 3, 4, 9, 10, 11, 12, 13] is addressed in many works. Other than the drift problem, focus should also be given to the integration of context information, recurring changes in concept and feature space evolution. Feature space evolution involves the feature sets and its importance to target concept may differ [16]. It is better to use learning models to solve the problems but these learning models are time consuming [14, 15, 17]. Relearning must be done, if the concept is new and non-recurring [6]. The novel ideas developed in [18, 19, 20, 21] shows the problem of recurrent concept change resolved without the relearning mechanism. Beyond this achievement, there arises the setting of user defined parameter issues, which tends to make it difficult to take a decision whether the new concept matches with the existing ones [21].

The key challenges in the data stream classification include the concept drift problem, and the classifier selection. The main contribution of this work is the development of novel PGNBC for data stream classification with the recurring drift concept. The contributions of this paper includes: addressing the recurrent concept of drift in data stream classification, proposing a Pearson Gaussian Naïve Bayes Classifier (PGNBC), Using PGNBC for Data Stream Classification with Recurring Concept Drift, Experimentally demonstrating the scenario of the developed model and studying the performance of the proposed method with the existing methods such as RGNBC and MReC-DFS.

This paper is organized as follows. Section 2 reviews about the works related to data stream classification and Section 3 gives the main motivation and the key challenges in the present work. Section 4 explains the construction of the proposed model which includes the modelling and the classification approaches and Section 5 details the updating of the PGNBC model with additional features, Section 6 discusses the results and Section 7 concludes the paper.

2. Literature review

Research works done in data stream classification are reviewed and presented in Table 1. The drift problem in the data stream classification is one of the commonly noted problems [18, 19, 20] and the researchers used the recurring concept drift for performing data stream classificatio [6, 21]. A lot of solutions are derived for solving this drift problem, but all those methods have their own drawbacks such as weight updating without the use of historical data, outlier data points and sample estimation in the arrival data. These problems are quite challenging in addition to the recurring concept drift. Based on the literature review, the recent research works can be categorized into four major categories. In the first category, Naive bayes classifier is taken for performing the data stream classification. In the second category, weighting mechanism is included to update the classifier model based on the new data. In the third category, dynamic feature space-based data stream classification is performed. In the fourth category, feature selection was used to perform the data stream classification.

Table 1
Literature review

Authors	Contribution	Advantages	Disadvantages
Zhou et al.	Hierarchy Restricted Naive Bayes classifier	Avoid the unwanted computation of field comparison, High performance	Produce extra false non matching result
Li et al.	Differential evolution-Naïve Bayes classifier	Optimal process for classification	Time consuming
Lee	Value weighting method	Minimizing the value of error function	Computational cost
Karabatak	Weighted Naïve Bayes classifier	Overcomes the problem of equal distribution	Computationally expensive and initialization of weight vector is application dependent
Brzezinski and Stefanowski	Accuracy-based weighting mechanisms	Consider the periodic weighting mechanism	Adapting weight for different data space seems tough
Gomes et al.	Dynamic feature space-based model learning	No holdout set is needed for testing, making use of all the available training data	Distribution of data is required to do classification
Masud et al.	Concept-drift and concept-evolution-based ensemble classifier	Addresses four major challenges, namely, infinite length, concept-drift, concept-evolution, and feature-evolution	It finds difficult to distinguish from the actual arrival of a novel class
Abdulsalam et al.	Combines the ideas of streaming decision trees and Random Forests	It quickly records the new expected classification accuracy after the changes are presented in the stream	Handling multiple classes with this hybrid model is difficult
Wankhade et al.	Hybrid feature selection-based classification	It correctly identified the best features using genetic algorithm	Genetic algorithm have the issue in finding the global minima due to local minimum
Lutu	Sliding window and feature selection-based classification	Delay due to updating of model is effectively handled with sliding window	The dynamic recurring feature space was not handled.

Naive bayes classifier-based data stream classification: Zhou et al. [18] developed the Hierarchy Restricted Naïve Bayes classifier to reduce the unwanted computation of field comparison. The developed method has high performance but it generates extra false non-matching results due to misalignment. For obtaining higher classification accuracy in Naïve Bayes, the differential evolution Naïve Bayes classifier is proposed by Li et al. [19].

Weighting mechanism based data stream classification: The value weighting method is added to the Naïve Bayes method by Lee [20] to minimize the value of error function. This method reduced the error function but it is computationally expensive. To overcome the problem of equal distribution, weighted Naïve Bayes classifier is proposed by Karabatak [21]. But, it has the disadvantage of its initialization weight vector in a case of application dependent way. Brzezinski and Stefanowski [5] proposed accuracy-based weighting mechanisms which considered the periodic weighting with a drawback of difficulty in adapting weight.

Dynamic feature space model for data stream classification: Dynamic feature space-based model learning method is developed by Gomes et al. [6] for considering all the data sets without involvement of the holdout set. The novel concept-drift and concept-evolution-based ensemble classifier is developed by Masud et al. [7] for solving the problem of infinite length, concept-drift, concept-evolution, and feature-evolution. The developed method cannot differentiate the new set of data. Abdulsalam et al. [8] used the ideas of both streaming decision trees and Random Forests to record the accuracy very fastly, but it cannot work with multiple classes.

Feature selection model-based data stream classification: Wankhade et al. [26] proposed a hybrid feature selection (HFS) method that adopts both filter and wrapper models of feature subset selection. The feature selection algorithm uses Genetic Algorithm to evaluate the contribution of features to the classification task in a feature subset. Lutu [27] have conducted experiments to identify efficient computational methods for selecting relevant features for NB classification based on the sliding window method of stream mining. To overcome these problems, we have proposed a novel method called Pearson Gaussian Naïve Bayes Classifier for Data Stream Classification with Recurring Concept Drift.

3. Problem definition

Data stream classification had to be done using the recurring concept drift. For this problem, the input data stream is taken as $D$ which is represented as,

$\displaystyle D=\left\{{d_{t};1\leqslant t\leqslant N}\right\}$ (1) $\displaystyle d_{t}=\left\{{d_{t}^{jk};1\leqslant j\leqslant n_{t};1\leqslant k% \leqslant m_{t}}\right\}$ (2)

Where, $d_{t}$ is upcoming database, $t$ is the time, $n_{t}$ is number of data objects, $m_{t}$ is number of features and variation in the dimensional features at each time causes the problem of recurring concept drift. So, $d_{t}=\left[{a_{t}^{1},a_{t}^{2},......a_{t}^{m}}\right].$

If the data is big, the classification of data is hard and so, depending upon the recurring data stream, the classification model can be updated. Still, data stream classification faces some challenges and it is described below:

•

The data stream changes with respect to time intervals and depending on the change of time, the learning model must be built dynamically with the adapting classifier.

•

During the updating process of the developed model, it is very important to consider the multiple scanning over the original database. The multiple scanning should not be used frequently, since the storage of historic data is very hard.

•

Since the boundary of the feature space varies, it is very necessary to consider the concept drift in the classifier model.

•

In the classification of data stream, selecting the classifier for the dynamic feature space is a challenge which is not addressed effectively.

•

For data stream classification, it is significant to focus on the recurring concept and context changes that occur during the dynamic change of feature space.

•

Preservation and selection of the dynamic features in the recurring concept drift is an added challenge to be considered.

The Naïve Bayes classification algorithm with weighting mechanism is used in [6], for solving the problem of recurring concept drift during data stream classification. The embedded weighting mechanism estimates the error and accuracy values. Because of the dynamicity of data, the considered data and classes are not constant with respect to the time period. Giving importance only to the error and accuracy will produce poor performance. So, it is better to consider specificity and sensitivity that should be added to the weighting mechanism.

4. Constructing Pearson Gaussian Naïve Bayes Classifier (PGNBC)

The data stream classification with Gaussian Naïve Bayes Classifier [24] includes-Model construction and the classification process. In model construction stage, the method estimates the parameters of a probability distribution, assuming predictors are conditionally independent given the class. During the construction of model, it is necessary to develop the information table, giving consideration to variance and mean of all attributes. More importantly, we additionally add the correlation among the attributes. The data provided is subjected to the next classification process, for calculating the posterior probability and thereby identifying the class labelling. In the classification step, the proposed method computes the posterior probability of that sample belonging to each class. The method then classifies the test data according to the largest posterior probability. Figure 1 shows the algorithmic description of the proposed PGNBC classifier.

Figure 1.

Description of the proposed PGNBC algorithm.

Model: The model construction is the initial stage and in this stage the information table is developed with the given input data represented as $d_{0}$ . The information table $IT_{t}$ is given by,

$\displaystyle IT_{t}=\left\{{IT_{t}^{\textit{mean}},IT_{t}^{\text{var}},IT_{t}% ^{\textit{cor}}}\right\}$ (3)

Where, $IT_{t}^{\textit{mean}}$ is the information table with mean values of the attributes, $IT_{t}^{\text{var}}$ is the information table with median values of the attributes, $IT_{t}^{\textit{cor}}$ is information table with correlation values of the attributes.

The information table with mean values of the attributes is given by,

$\displaystyle IT_{t}^{\textit{mean}}=\left\{{IT_{ak}^{\textit{mean}};1% \leqslant a\leqslant c;1\leqslant k\leqslant m_{t}}\right\}$ (4)

where, size of the table is $c\times m_{t}$ and $m_{t}$ is the number of attributes at time period $t$ and $c$ is number of classes, $a$ represents the class with attribute $k$ .

$\displaystyle IT_{ak}^{\textit{mean}}=\frac{1}{n_{t}^{a}}\sum\limits_{j=1}^{n_% {t}^{a}}{d_{t}^{jk}}$ (5)

where, $n_{t}^{a}$ is the data sample presented in the $a^{\text{th}}$ class with time period $t$ and $d_{t}^{jk}$ represents the data value belonging to the $j^{\text{th}}$ data of $k^{\text{th}}$ attributes at particular time period $t$ . The variance value is computed based on the following equation.

$\displaystyle IT_{t}^{\text{var}}=\left\{{IT_{ak}^{\text{var}};1\leqslant a% \leqslant c;1\leqslant k\leqslant m_{t}}\right\}$ (6)

where, $IT_{ak}^{\text{var}}$ represents the information table with variance of the attribute and its is computed as follows,

$\displaystyle IT_{ak}^{\text{var}}=\frac{1}{n_{t}}\sum\limits_{j=1}^{n_{t}}{% \left[{IT_{ak}^{\textit{mean}}-f_{t}^{jk}}\right]}$ (7)

The information table for the correlation values of the attributes has a table with the size $c\times 1$ . For all the attributes, the virtual correlation factor is given as:

$\displaystyle IT_{a}^{\textit{cor}}=f\left({a_{1},a_{2},...a_{mt}}\right)$ (8)

where, $f\left({a_{1},a_{2},...a_{mt}}\right)$ represents the function of attributes in the given data and it is computed as follows:

$\displaystyle f\left({a_{1},a_{2},...a_{mt}}\right)=\frac{1}{1+2+...+\left({mt% -1}\right)}\sum\limits_{k=1}^{mt}\sum\limits_{f=k+1}^{mt}{r\left({a_{f},a_{k}}% \right)}$ (9)

As we know that, $1+2+...+\left({mt-1}\right)=\frac{mt\left({mt-1}\right)}{2}$ based on the Triangular number series, the above equation changes to,

$\displaystyle f\left({a_{1},a_{2},...a_{mt}}\right)=\frac{1}{\left({\frac{mt% \left({mt-1}\right)}{2}}\right)}\sum\limits_{k=1}^{mt}\sum\limits_{f=k+1}^{mt}% {r\left({a_{f},a_{k}}\right)}$ (10) $\displaystyle f\left({a_{1},a_{2},...a_{mt}}\right)=\frac{2}{mt\left({mt-1}% \right)}\sum\limits_{k=1}^{mt}\sum\limits_{f=k+1}^{mt}{r\left({a_{f},a_{k}}% \right)}$ (11)

where, $a_{f},a_{k}$ represents the sets of data.

The relationship between the sets of data can be determined with the correlation. The correlation function for independent data sets for performance improvement is given by,

$\displaystyle r\left({a_{f},a_{k}}\right)=\left[{\frac{\textit{correlative}% \left({a_{f},a_{k}}\right)+1}{2}}\right]$ (12)

The ratio between the covariance of the two variables to the product of their standard deviations is known as the Pearson correlation coefficient. The linear correlation among the two data sets can be estimated using $\textit{correlative}\left({a_{f},a_{k}}\right)$ , which is based on the Pearson’s correlation coefficient $k$ and so,

$\displaystyle\textit{correlative}\left({a_{f},a_{k}}\right)=\frac{\sum\limits_% {i=1}^{n}\left({{d_{ik}-\bar{d}_{k}}}\right)(d_{if}-\bar{d}_{f})}{\sqrt{\sum% \limits_{i=1}^{n}{\left({d_{ik}-\bar{d}_{k}}\right)}}^{2}\sqrt{\sum\limits_{i=% 1}^{n}\left({d_{if}-\bar{d}_{f}}\right)^{2}}}$ (13)

The class-conditional independence assumption greatly simplifies the training step since we can estimate the one-dimensional class-conditional density for each predictor individually. While the class-conditional independence between predictors is not true in general, research shows that this optimistic assumption works well in practice. This assumption of class-conditional independence of the predictors allows the classifier to estimate the parameters required for accurate classification while using less training data than many other classifiers. This makes it particularly effective for data sets containing many predictors.

Classification: The constructed model is then subjected to classification process which involves the identification of high posterior probability from the input data having the corresponding class. The algorithm leverages Bayes theorem, and (naively) assumes that the predictors are conditionally independent, given the class. Though the assumption is usually violated in practice, naive Bayes classifiers tend to yield posterior distributions that are robust to biased class density estimates. The posterior probability is calculated with the given data $d_{t}^{x}$ and the corresponding class $C_{a}$ using the following equations.

$\displaystyle C\left({d_{t}^{x}}\right)=\mathop{\textit{Max}}\limits_{a=1}^{C}% \left({\textit{posterior}\left({C_{a}\left|{d_{t}^{x}}\right.}\right)\ast IT_{% a}^{\textit{cor}}}\right)$ (14) $\displaystyle\textit{posterior}\left({C_{a}\left|{d_{t}^{x}}\right.}\right)=% \frac{P\left({C_{a}}\right)*\mathop{\Pi}\limits_{k=1}^{m_{t}}P\left({A_{k}^{t}% \left|{C_{a}}\right.}\right)}{\textit{Evidence}}$ (15)

where, $P\left({C_{a}}\right)$ is the probability occurrence for class $C_{a},P\left({A_{k}^{t}\left|{C_{a}}\right.}\right)$ is the conditional probability of the attribute $A_{k}^{t}$ with the class $C_{a}$ . But, the evidence is the measure of the summation of the posterior probability of every class with respect to the input data and so,

$\displaystyle\textit{Evidence}=\sum\limits_{a=1}^{C}{\textit{posterior}\left({% C_{a}\left|{d_{t}^{x}}\right.}\right)}$ (16)

On the whole, the conditional probability can be calculated using the below equation.

$\displaystyle P\left({A_{k}^{t}\left|{C_{a}}\right.}\right)=\frac{1}{\sqrt{2% \pi\ast IT_{ak}^{\text{var}}}}*\exp\left({\frac{-\left({d_{t}^{x}-IT_{ak}^{% \textit{mean}}}\right)^{2}}{2\ast IT_{ak}^{\text{var}}}}\right)$ (17)

where, $IT_{ak}^{\text{var}}$ represents the variance of $k^{\text{th}}$ attribute of $a^{\text{th}}$ class, $IT_{ak}^{\textit{mean}}$ represents the mean of $k^{\text{th}}$ attribute of $a^{\text{th}}$ class, $d_{t}^{x}$ represents the input data.

5. Adapting Pearson Gaussian Naïve Bayes Classifier for data stream classification with recurring concept drift

The next process includes the updating of the developed PGNBC model with the input data. The updating process is about to detect the Change of concept Drift (COD) by rough set theory [25], updating of Gaussian Naïve bayes classifier (GNBC) model and updating important features. Figure 2 illustrates the constructed PGNBC model. At different time intervals, the dynamic feature model is build with different features. The information table is created with the input data. The rough set theory is exploited for the concept drift problem. The important features are updated and the classification is performed. Figure 3 shows the algorithmic description of the data stream classification.

Figure 2.

Block diagram of the proposed Pearson Gaussian Naïve Bayes Classifier for data stream classification with recurring concept drift.

Figure 3.

Algorithmic description of the data stream classification.

5.1 Detecting concept drift by rough set theory

Once the new data stream is arrived for the classification at time interval $t$ , the data is classified based on the updated model available at $IT_{1:t}$ and it is updated to $IT_{1:t+1}$ after knowing the class information. Here, the assumption is that at the time interval $t+1$ , the class label of the previous data stream at time interval $t$ is known. The detection of concept drift is much important because the classifier should update its model based on the characteristics of the new data.

At a particular time interval $t$ , the obtained data is subjected to class function depending upon the model developed at $IT_{1:t}$ . The next updating will be performed with the new data which have the concept drift, so that the boundary of the data is altered inside or outside. It is important to find the exact time stamp for updating the classifier when a concept drift is detected. Here, rough set theory is the right choice to find the concept drift as it is based upon the approximation of sets by a pair of sets known as the lower approximation and upper approximation of the set. Here, the lower and upper approximation operators are based on equivalence relation. However, the requirement of equivalence relation is a restrictive condition to have the idea of approximation of concept drift.

The lower approximation that lies within the boundary and upper approximation that lies outside the boundary is represented as $\underline{P}Y$ and $\overline{P}Y$ respectively. In the upper approximation, the non-empty intersection occurs with the target set. The lower approximation is indicated as $\underline{P}Y=\left\{{y\left|{\left[y\right]_{a}\leqslant Y}\right.}\right\}$ and the upper approximation is indicated as $\overline{P}Y=\left\{{y\left|{\left[y\right]}\right._{a}\cap Y\neq 0}\right\}$ .

The ratio of lower approximation $\underline{P}Y$ and the upper approximation $\overline{P}Y$ gives the accuracy of the approximation and it is shown as below.

$\displaystyle\textit{COD}\left(Y\right)=\frac{\underline{P}Y}{\overline{P}Y}$ (18)

The approximation accuracy is calculated and then the value is compared with the predefined threshold ( $T_{\textit{COD}}$ ). If $\textit{COD}\left(y\right)<T_{\textit{COD}}$ , then the PGNBC model undergoes the updating process. Fixing right threshold for detecting the concept drift is a significant problem which is effectively identified through the experimentation.

5.2 Updating PGNBC model

The updation of PGNBC model is carried out with the new information table with data $d_{t}$ and time interval $t+1$ . For updating, the historic raw data are not considered. The updating of the information table can be performed by updating the information table belonging to the mean and variance. The information table is given by,

$\displaystyle IT_{1:t+1}=\left\{{IT_{1:t+1}^{\textit{mean}},IT_{1:t+1}^{\text{% var}},IT_{1:t+1}^{\textit{cor}}}\right\}$ (19)

where, $IT_{1:t+1}^{\textit{mean}},IT_{1:t+1}^{\text{var}}$ and $IT_{1:t+1}^{\textit{cor}}$ represents the information table for mean, variance and correlation respectively. The first element of the information table is mean, which is updated by taking the mean of the new data. The following equation is used to find the mean to be presented in the updated information table. Here, the mean of the historic data which is already presented in the information table and the mean of the current data with the number of data instances are considered.

$\displaystyle IT_{1:t+1}^{\textit{mean}}=\frac{IT_{1:t}^{\textit{mean}}*n_{1:t% }+IT_{t+1}^{\textit{mean}}*n_{t+1}}{n_{1:t}+n_{t+1}}$ (20)

where, $IT_{1:t}^{\textit{mean}}$ , $n_{1:t}$ , $IT_{t+1}^{\textit{mean}}$ , and $n_{t+1}$ indicates the information table for mean at time interval $t$ , count of the data from the time interval 1 to $t$ , information table constructed only based on the new data at time interval $t+1$ , and the count of the data at the time interval $t+1$ respectively. For the information table belonging to variance, the following equation is used. This equation considered the variance of the historic data and the variance of the new data along with the number of data instances. The weighted variance based on the number of data instances provide the variance measure which is to be updated in the information table.

$\displaystyle IT_{1:t+1}^{\text{var}}=\frac{IT_{1:t}^{\text{var}}*n_{1:t}+IT_{% t+1}^{\text{var}}*n_{t+1}}{n_{1:t}+n_{t+1}}$ (21)

where, $IT_{1:t}^{\text{var}}$ is the information table belonging to mean at time interval $t$ , and $IT_{t+1}^{\text{var}}$ is the information table constructed based on only the new data at time interval $t+1$ . For updating the correlation, following equation is used. Here, the average of the correlation between the historic data and the current data is taken.

$\displaystyle IT_{1:t+1}^{\textit{cor}}=\frac{IT_{1:t}^{\textit{cor}}+IT_{t+1}% ^{\textit{cor}}}{2}$ (22)

where, $IT_{1:t}^{\textit{cor}}$ is the information table belonging to correlation at time interval $t$ , and $IT_{t+1}^{\textit{cor}}$ is the correlation in the information table constructed based on only the new data at time interval $t+1$ .

5.3 Updating important features

The dimension of the data can be minimized with the feature selection process. The objective of feature selection is to reduce the size of the feature vector without sacrificing the performance of the categorization. Feature selection methods are divided into two main approaches: filtering [28], and wrapper [29]. In the wrapper approach, a classifier evaluates many different subsets and selects one with the highest accuracy rate. On the other hand, in filtering approaches, the construction of the final subset is not classifier dependent. Both approaches select a subset of features using an evaluation function. A common measure of relevance in many feature selection algorithms is based on the entropy, considered to be a good indicator of relations between the input feature and the target variable.

The entropy [23] which is a feature evaluation function, helps in determining all the features of the data in all attribute classes, can be used to minimize the dimension of the data. Particularly, the selections of features are based on the degree of significance. If the degree of significance is high, then the final subset of features is selected.The features with low entropy values are considered for the final sequence of processes.

$\displaystyle F(a_{t}^{k})=-\sum\limits_{i=1}^{u(a_{t})}{P_{i}\log(P_{i})}$ (23)

where, $a_{t}^{k}$ is the attribute vector and $u(a_{i})$ is the number of unique values in attribute vector.

5.4 Classification using updated PGNBC model considering recurring concept drift

The classification is done with the updated PGNBC model by giving importance to the recurring concept drift. The newly obtained data is represented as $d_{t+1}$ . Since the recurring dimensional space is taken, the new data with the multiple attributes may be new or old. Addition of new attributes changes the mean and variance value. The new information table is given as:

$\displaystyle IT_{1:t+1}\Rightarrow IT_{1:t+1}^{\text{Rec}}$ (24)

So, the information table obtained is in the minimized form and can be used for the prediction of the class label of the new data with the posterior probability function.

$\displaystyle C\left({d_{t+1}^{x}}\right)=\mathop{\textit{Max}}\limits_{a=1}^{% C}\left({\textit{posterior}\left({C_{a}\left|{d_{t+1}^{x}}\right.}\right)*O_{t% }\ast IT_{1:t+1}^{\textit{cor}}}\right)$ (25)

where, $O_{t}$ is the objective function which can be formulated as,

$\displaystyle O_{t}=\frac{1}{3}(E_{t}+\textit{Sen}_{t}+\textit{Spec}_{t})$ (26)

The objective function considers the specificity, sensitivity and accuracy of the values represented with respect to time interval $E_{t}$ , $\textit{Sen}_{t}$ and $\textit{Spec}_{t}$ respectively. The accuracy $E_{t}$ is calculated using the following equation:

$\displaystyle E_{t}=\frac{1}{t}\sum\limits_{t=1}^{t}{PE_{t}}$ (27)

where,

$\displaystyle PE_{t}=\frac{n_{t}^{c}}{n_{t}}=\frac{\textit{number of data % samples correctly classified at timet}}{\textit{number of data samples at % timet}}$ (28)

6. Results and discussion

The experimental results of the proposed PGNBC model are discussed and the comparative analysis with the existing models such as RGNBC and MReC-DFS is calculated using sensitivity, specificity and accuracy as performance measures.

6.1 Dataset description

The Skin Segmentation data set and Localization data set collected from UC Irvine Machine Learning [22] is used for the experimental analysis.

Skin Segmentation data set (database 1): The skin dataset is accumulated by using the random sampling of B, G, R values from face images of different age groups, race groups, and genders obtained from FERET database and PAL database. The total number of instances is 245057; among them 50859 are the skin samples and 194198 are non-skin samples.

Localization data for person activity data set (database 2): This database includes the data from the people who were used for recording the data by wearing four tags (ankle left, ankle right, belt and chest). These tags can be detected by anyone attributes. The total number of instances is 164860.

6.2 Evaluation metrics

The performance evaluation is done using sensitivity, specificity and accuracy metrics. Sensitivity refers to the proportion of true positives which can be correctly identified by a diagnostic test. Sensitivity is otherwise called as True Detection Percentage (TDP).

$\displaystyle\textit{Sensitivity}=TP/\left({TP+FN}\right)$ (29)

Specificity is the proportion of the true negatives correctly identified by a diagnostic test to predict how good the test is for identifying the normal (negative) condition.

$\displaystyle\textit{Specificity}=TN/\left({TN+FP}\right)$ (30)

Accuracy shows the proportion of true results, which may be either true positive or true negative in a population, thereby measuring the degree of veracity of a diagnostic test on a particular condition.

$\displaystyle\textit{Accuracy}=\left({TN+TP}\right)/\left({TN+TP+FN+FP}\right)$ (31)

False detection percentage (FDP) is computed by taking the ratio of false positive with the total number of positive class samples.

$\displaystyle\textit{FDP}=FP/\left({TP+FP}\right)$ (32)

Update delay is computed by measuring the delay (in sec) in between the new data arrival and updating of model.

True positive (TP) means the correctly identified, False positive (FP) means the incorrectly identified, True negative (TN) means the correctly rejected and False negative (FN) means the incorrectly rejected.

6.3 Experimental set up

The innovative PGNBC model is implemented with the aid of Java 1.7 with netbeans IDE 7.3. The execution is carried out in a Windows 8.1 system with i5 processor of 2.2 GHz CPU clock speeds having 4 GB RAM and 64 bit operating system. The input data taken for the experimentation is divided into $t$ number of chunks based on the user input. The selection of samples from the input data for every chunk is purely in a random manner. If the chunk size is large, the data samples within every chunk are less. If the chunk size is less, the data samples within every chunk are large. The measures such as sensitivity, accuracy and specificity are used for estimating the performance of the proposed method with the existing systems such as RGNBC and MReC-DFS [6]. The best value in the comparative analysis is calculated with the parameter of COD ( $T_{\textit{COD}}$ ).

RGNBC: This scheme utilizes the Gaussian Naive bayes classifier and the updating of model depends on the rough set theory. This scheme is almost similar to the proposed PGNBC method but the classification model utilized for updating is the existing Gaussian Naive bayes classifier.

MReC-DFS: This is a data stream classification system to address the challenges of learning recurring concepts in a dynamic feature space while simultaneously reducing the memory cost associated with storing past models. MReC-DFS is able to detect and adapt to the concept changes using the performance of the learning process and contextual information. To handle recurring concepts, stored models are combined in a dynamically weighted ensemble.

6.4 Performance evaluation of the COD threshold

Skin data: Using the skin segmentation data, the sensitivity, specificity and accuracy measures at different COD threshold (0.2, 0.3, 0.4, and 0.5) are evaluated and it is shown in Fig. 4. In Fig. 4a, it was found that the sensitivity is high, which is about 66.6% when the Threshold for Change Of concept Drift (TCOD) is 0.3. All the curves overlap each other when the chunk size is in between 5 to 10. At higher threshold value of 0.4 and 0.5, the sensitivity is poor, which is about 65%. Both the curves with TCOD 0.2 and 0.3 have the same sensitivity. When the chunk size is 25, the sensitivity is improved. The deviational difference between the 0.3 and 0.4 threshold curves are found to be 2%. For TCOD 0.3 and 0.2 curve, the curve bends at chunk size 20 showing that the performance is poor. At higher threshold with less chunk size, the curve starts with specificity of 68% and increases in specificity up to 85% while increasing the chunk size. But at lower threshold curves, the specificity is high with low chunk size and bends slightly with less deviation when the chunk size is increased. At chunk size 25, the specificity is 70% when TCOD is fixed to 0.3 and 0.2. Also, the accuracy is greatly improved when the TCOD is 0.4 and 0.5 with 74%. At chunk size of 5 to 10, the 0.4 and 0.5 curve is linear and then it shows a steep rise and reaches the accuracy of 74%. But for the higher threshold values, the accuracy value is nearly 67%. At chunk size of 10, the lower threshold curve shows a peak accuracy of 74%, but it bends down and the accuracy is greatly reduced. This clearly shows that the accuracy is poor for lower threshold values.

Figure 4.

Performance evaluation of skin data at varying COD threshold with metrics: (a) sensitivity, (b) specificity and (c) accuracy.

Localization data: The sensitivity, specificity and accuracy measures at different COD threshold values (0.2, 0.3, 0.4, and 0.5) are estimated with the localization data and it is shown in Fig. 5. It was found that about 80% sensitivity is achieved when the high threshold values are used. At chunk size 20, the high threshold curves show a very low sensitivity of 76%, which almost reaches the sensitivity of the low threshold value curves. About 20% deviational difference is seen between the higher and the lower threshold value curves. For chunk size 5, the lower threshold value curves have high sensitivity of about 79% and it suddenly decreases to 76% that clearly marks the poor performance. The specificity is better for the higher threshold values. When the threshold value is low, the curves show better performance at less chunk size, but gradually the specificity slopes down when the chunk size is increased. But, for the higher threshold value curves, the specificity is 66% at chunk size 5 and at chunk size of 25, there is only little improvement in specificity. Comparing both the lower and higher TCODs, the higher TCODs perform well. The deviational difference is found to be 0.3% between TCOD $=$ 0.3 and TCOD $=$ 0.4 at a chunk size of 25. But the difference in specificity is high between the curves when the chunk size is 15. The accuracy is better for the curves with lower threshold value. At chunk size of 25, the TCOD of 0.3 curves reaches the 76% accuracy, but the TCOD of 0.5 curves reaches only 74%. This shows the effectiveness of the higher threshold value curves. When the chunk size is low, the 0.5 threshold curve shows better performance than the 0.2 threshold curve up to chunk size 20 and then decreases. The accuracy difference is about 23% between the curves-TCOD $=$ 0.4 and 0.3, at chunk size 5.

Figure 5.

Performance evaluation of localization data at varying COD threshold with metrics: (a) sensitivity, (b) specificity and (c) accuracy.

6.5 Comparative analysis

Sensitivity: Figure 6 shows the comparative analysis of sensitivity measure for the proposed PGNBC, conventional RGNBC and MReC-DFS methods. For the proposed method, the sensitivity is found to be 4% better than the RGNBC and 5% better than the MReC-DFS while using the skin data at chunk size 5. Both the PGNBC and RGNBC moves parallel with same deviational difference in the sensitivity level in all chunk size. The accuracy is near to 60% for MReC-DFS when the chunk size is 5 and at chunk size of 25, it is near to 61%. Both the RGNBC and MReC-DFS curves are very close to each other with very less deviation in the sensitivity level. With skin data, the sensitivity is very high for the proposed PGNBC method with high deviation difference than the other two existing methods.

When localization data is used, the sensitivity is found to be very high for both the proposed PGNBC and the existing RGNBC model. It is about 80% for the proposed method. But while comparing the proposed method curve with the MReC-DFS method curve, the deviation difference in sensitivity level is found to be 20%. The existing MReC-DFS curve shows a slight decrease in its sensitivity at increased chunk size. The curve of the existing MReC-DFS method runs parallel to the proposed method curve. In the proposed method, when chunk size is increased, the sensitivity is also increased automatically. When the localization data is used, the sensitivity of the proposed method is high than the sensitivity of the skin data.

Figure 6.

Comparative analysis of sensitivity measure for the proposed PGNBC, conventional RGNBC and MReC-DFS using (a) skin data and (b) localization data.

Specificity: While using skin data, the specificity for the RGNBC method is quite medium comparing to the proposed method. It was found that the specificity of the proposed method is increased for both skin data and localization data which are shown in Fig. 7. While using skin data, the specificity of the MReC-DFS method is very less with 60% at chunk size 5 and 58% at chunk size 25. This shows that the specificity of the MReC-DFS method decreases when the chunk size is increased. For the proposed method, the curve increases at chunk size 10 showing a steep rise with 85% specificity. Deviational difference is less between the PGNBC and RGNBC and it is about 2%. In RGNBC method curve, the specificity level is closer to the proposed method with minimum differences.

When localization data is utilized, the specificity of the MReC-DFS method is better than the specificity of the skin data. Also, the specificity of the proposed method is about 66%. But for the RGNBC, only 62% is achieved and for existing MReC-DFS method, the specificity is about 60%. The specificity is greatly decreased for the existing MReC-DFS method while increasing the chunk size. For RGNBC also, the specificity decreases when increasing the number of chunks. The deviation is high between the RGNBC and the MReC-DFS. Also, the deviation between the proposed method and the RGNBC method is about 3%. The specificity is high for the proposed method when the skin segmentation data is used.

Figure 7.

Comparative analysis of specificity measure for the proposed PGNBC, conventional RGNBC and MReC-DFS using (a) skin data and (b) localization data.

Figure 8.

Comparative analysis of accuracy measure for the proposed PGNBC, conventional RGNBC and MReC-DFS using (a) skin data and (b) localization data.

Accuracy: Figure 8 shows the comparative analysis of the accuracy measure for the proposed PGNBC method, conventional RGNBC method and MReC-DFS method. The accuracy of the proposed method is 3% better than the RGNBC and 5% better than the MReC-DFS while using the skin data. Both the PGNBC and RGNBC moves parallel to each other. When the chunk size is increased, the PGNBC reaches the highest accuracy of 72%. At chunk size 10, the RGNBC and the MReC-DFS curves overlap. For PGNBC, the accuracy increases for chunk size between 10 to 15 and then it moves straight in a linear form. The difference in accuracy levels between three curves is found to be very less. But, the MReC-DFS curve shows decrease in the accuracy level, when the number of chunks are increased.

Comparative study with the localization data shows the increased accuracy of the proposed model. When the chunk size increases up to 20, the accuracy increases and then shows a significant decrease at chunk size 25. However, the proposed method is better than the other two existing methods used for the comparative study. In the case of MReC-DFS, the curve decreases for increase in chunk size. The accuracy is about 74% for the PGNBC and 71% for the RGNBC method. Between the RGNBC and the MReC-DBS curve, high deviation is found which automatically shows the less accuracy of the existing MReC-DFS method. The accuracy level is very poor for the existing MReC-DFS method which shows the effectiveness of the proposed method which shows high accuracy levels. The difference in accuracy values between the existing MReC-DFS and the proposed method curve is about 6% at lower chunk size and about 1% between the RGNBC and the proposed PGNBC method. On the whole, the accuracy is better for the proposed method when the localization data is used.

FDP: Figure 9a shows the comparative analysis of the methods in terms of FDP using skin data. When comparing the results, the proposed PGNBC method shows better performance than the RGNBC and MReC-DFS methods. Here, the proposed method obtained the FDP of 20.45% when compared with RGNBC method which obtained the value of 24.45%. When the chunk size is increased, the performance of the system is degraded by showing maximum FDP. Similarly, the comparative analysis of the methods in terms of FDP using localization data is given in Fig. 9b. From the results, we can understand that the proposed PGNBC method obtained the minimum FDP of 22.275% as compared with the existing RGNBC method which obtains the value of 25.20%. Overall, the proposed PGNBC outperformed the existing RGNBC and MReC-DFS methods by showing minimum FDP rate.

Figure 9.

Comparative analysis of FDP measure for the proposed PGNBC, conventional RGNBC and MReC-DFS using (a) skin data and (b) localization data.

Update delay: Figure 10a shows the update delay of the proposed PGNBC method and existing RGNBC and MReC-DFS methods. From the figure, we can understand that the proposed method shows the minimum delay as compared with the existing RGNBC and MReC-DFS methods. For example, when the chunk size is fixed as 5, the system shows the maximum delay of 6 sec as compared with the existing MReC-DFS method which requires 12 sec. For the chunk size of 25, the proposed PGNBC method and the existing RGNBC show the delay of 3 sec. Similarly, the performance in terms of update delay is plotted for localization data in Fig. 10b. From the figure, we can understand that the proposed PGNBC method and the existing RGNBC method makes the delay of 5 sec but the existing MReC-DFS method makes the delay of 10 sec for the chunk size of five. From the analysis, we conclude that the update delay is decreasing when the chunk size of data is large or the data sample updated on every time is less.

Figure 10.

Comparative analysis of update delay measure for the proposed PGNBC, conventional RGNBC and MReC-DFS using (a) skin data and (b) localization data.

7. Conclusion

Considering the concept drift with a suitable classifier has been a great task in data stream classification. The proposed work had been focused on the recurrent concept drift with updating a new model. The Pearson Guassian Naïve Bayes classification model has been proposed with new dynamic features. Here, correlation and objective measure were additionally included in PGNBC method and the updating of classification model ws performed based on the rough set theory. The data sets such as skin database and localization database are used for the experimentation and the performance is evaluated with metric measures such as sensitivity, specificity and accuracy. It was found that, at higher threshold value, the accuracy, sensitivity and specificity is better for the localization data than the skin data. When skin data is used, the improvement in terms of sensitivity, specificity and accuracy has been found to be 4%, 1% and 1% respectively, which is high for PGNBC method than RGNBC method. With the localization data, the improvement in terms of specificity and accuracy has been found to be 6% and 2% for the proposed method which is more than the RGNBC method.

References

Mena-Torres

and Aguilar-Ruiz

J.S.

, A similarity-based approach for data stream classification, Expert Systems with Applications 41 (2014), 4224–4234.

Alippi

Liu

Zhao

and Bu

, Detecting and Reacting to Changes in Sensing Units: The Active Classifier Case, IEEE Transactions on Systems, Man, and Cybernetics: Systems 44(3) (2013), 353–362.

Zhang

Zhou

Wang

Gao

B.J.

Zhu

and Guo

, E-Tree: An Efficient Indexing Structure for Ensemble Models on Data Streams, IEEE Transactions on Knowledge and Data Engineering 27(2) (February 2015), 461–474.

Rutkowski

Jaworski

Pietruczuk

and Duda

, Decision Trees for Mining Data Streams Based on the Gaussian Approximation, IEEE Transactions on Knowledge and Data Engineering 26(1) (January 2014), 108–119.

Brzezinski

and Stefanowski

, Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm, IEEE Transactions on Neural Networks and Learning Systems 25(1) (January 2014), 81–94.

Gomes

J.B.

Gaber

M.M.

Sousa

P.A.C.

and Menasalvas

, Mining Recurring Concepts in a Dynamic Feature Space, IEEE Transactions on Neural Networks and Learning Systems 25(1) (January 2014), 95–110.

Masud

M.M.

Chen

Khan

Aggarwal

C.C.

Gao

Han

Srivastava

and Oza

N.C.

, Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams, IEEE Transactions on Knowledge and Data Engineering 25(7) (July 2013), 1484–1497.

Abdulsalam

Skillicorn

D.B.

and Martin

, Classification Using Streaming Random Forests, IEEE Transactions on Knowledge and Data Engineering 23(1) (January 2011), 22–36.

Fan

, Systematic Data Selection to Mine Concept-Drifting Data Streams, in: Proc. ACM SIGKDD 10th Int’l Conf. Knowledge Discovery and Data Mining, 2004, pp. 128–137.

10.

Gao

Fan

and Han

, On Appropriate Assumptions to Mine Data Streams, in: Proc. IEEE Seventh Int’l Conf. Data Mining (ICDM), 2007, pp. 143–152.

11.

Hulten

Spencer

and Domingos

, Mining Time-Changing Data Streams, in: Proc. ACM SIGKDD Seventh Int’l Conf. Knowledge Discovery and Data Mining, 2001, pp. 97–106.

12.

Kolter

and Maloof

, Using Additive Expert Ensembles to Cope with Concept Drift, in: Proc. 22nd Int’l Conf. Machine Learning (ICML), 2005, pp. 449–456.

13.

Wang

Fan

P.S.

and Han

, Mining Concept-Drifting Data Streams Using Ensemble Classifiers, in: Proc. ACM SIGKDD Ninth Int’l Conf. Knowledge Discovery and Data Mining, 2003, pp. 226–235.

14.

Gomes

J.B.

Menasalvas

and Sousa

, Tracking recurrent concepts using context, in: Proc. 7th Int. Conf. RSCTC, 2010, pp. 168–177.

15.

Gama

and Kosina

, Tracking recurring concepts with metalearners, in: Proc. 14th Portuguese Conf. Artif. Intell., Oct. 2009, pp. 423.

16.

Katakis

Tsoumakas

and Vlahavas

, On the utility of incremental feature selection for the classification of textual data streams, in Advances in Informatics. New York, NY, USA: Springer-Verlag, 2005, 338–348.

17.

Yang

and Zhu

, Mining in anticipation for concept change: Proactive-reactive prediction in data streams, Data Mining Knowl. Discovery 13(3) (2006), 261–289.

18.

Zhou

Howroyd

Danicic

and Bishop

J.M.

, Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage, Advanced Methodologies for Bayesian Networks 9505 (2015), 93–104.

19.

Fang

and Wang

, A Novel Naive Bayes Classifier Model Based on Differential Evolution, Intelligent Computing Theories and Methodologies 9225 (August 2015), 558–566.

20.

Lee

C.H.

, A gradient approach for value weighted classification learning in naive Bayes, Knowledge-Based Systems 85 (September 2015), 71–79.

21.

Karabatak

, A new classifier for breast cancer detection based on Naïve Bayesian, Measurement 72 (August 2015), 32–36.

22.

UC Irvine Machine Learning Repository from http://archive.ics.uci.edu/ml/datasets.html.

23.

Pinheiro

R.H.W.

Cavalcanti

G.D.C.

and IngRen

, Data-driven global-ranking local feature selection methods for text 4 categorization, Expert Systems with Applications 42(4) (March 2015), 1941–1949.

24.

Rish

, An empirical study of the naive Bayes classifier, in proceedinsg of IJCAI Workshop on Empirical Methods in AI, 2001.

25.

Pawlak

, Rough sets, International Journal of Parallel Programming 11(5) (1982), 341–356.

26.

Wankhade

Rane

and Thool

, A new feature selection algorithm for stream Data Classification, in: Proceedinsg of International Conference on Advances in Computing, Communications and Informatics (ICACCI), August 2013.

27.

Lutu

P.E.N.

, Fast Feature Selection for Naive Bayes Classification in Data Stream Mining, Proceedings of the World Congress on Engineering 2013 Vol III, WCE 2013, July 3–5, 2013.

28.

and Liu

, Feature selection for high-dimensional data: A fast correlation-based filter solution, in: Proceedings of the International Conference on Machine Leaning, 2003, pp. 856–863.

29.

Bermejo

Gámez

and Puerta

, Speeding up incremental wrapper feature subset selection with Naive Bayes classifier, Knowledge-Based Systems 55 (2014), 140–147.