Abstract
Time series classification and class imbalance problem are two common issues in a multitude of real-life scenarios. This paper simultaneously explores both issues with deep convolution neural networks (CNNs). Because standard networks treat the majority and minority classes with same class weights, most CNN-based networks fail to classify imbalanced time series. Until recently, there is very little work applying deep learning to imbalanced time series classification (ITSC). Thus, we propose an adaptive cost-sensitive learning strategy to address the ITSC problem. The standard CNN is modified to a cost-sensitive network (CS-CNN), which is able to punish the misclassified samples using a class-dependent cost matrix. Moreover, this cost matrix is automatically updated based on overall class distribution and the CS-CNN’s training performance. The proposed method is extended to FCN, LSTM-FCN and ResNet. It is experimentally tested on five public benchmark UCR datasets and a real-life large volume dataset. Four cost-sensitive CNN-based networks are compared with several data samplers and two traditional ITSC methods. The modified networks are superior in all metrics. Results show that cost-sensitive networks successfully complete the ITSC tasks.
Keywords
Introduction
Class imbalance problem (CIP) and time series classification (TSC) are the top two challenges in data mining and machine learning [48]. They have attracted increasing research enthusiasms from different communities in the past years. However, most of the previous research merely focuse on CIP or TSC separately [19, 20]. In fact, imbalanced time series classification (ITSC) [32] problems are frequently encountered in widespread scenarios, such as behavior detection [24], medical treatments [1, 40], sleep monitoring [8], and industrial hazards surveillance [12, 21, 42]. ITSC is a special case of TSC, where the minority or positive class is underrepresented [31, 32]. Because of the skewed time series datasets, classifiers tend to bias toward the less important majority or negative class [28]. Therefore, it is challenging and crucial to recognize rare events (positive class) accurately. However, misclassification of a few negative instances is tolerable [36]. As an example, in industrial surveillance, hazards occur as a positive class, which cannot be neglected by the monitoring system [21].
Previous research on CIPs are conducted at two levels. The first level is data manipulation or data-sampling. A skewed training set is re-established into balance by over-sampling or under-sampling, or by combing both techniques [6, 7, 22, 31, 32]. Algorithmic modification is the second level [37, 51], where classifiers are modified with higher costs or class weights for false positive samples. For both methods, conventional machine learning algorithms like K-nearest neighbors combined with danymic time warping (KNN-DTW) [47], support vectoer machine (SVM) [6, 22], decision tree [31] are applied. However, a few traditional algorithms need hand-crafted feature engineering to guarantee classification performance. This is time-consuming and impractical for large volume datasets. Recently, deep convolution neural networks (CNNs) have been explored on CIP and TSC tasks. A systematic investigation can be found in [5]. Alternatively, a CNN can capture the time-shift properties and invariant features from time series [11, 13, 23, 36, 44, 47, 50] automatically. Nevertheless, very few attempts have been made at addressing the ITSC problems with CNN-based algorithms.
In this paper, both CIP and TSC issues have been simultaneously solved with cost-sensitive deep learning algorithms. We propose an adaptive cost-sensitive learning strategy with a class-dependent cost matrix. Unlike previous efforts [37, 51], manual setting costs are not used, which reduces the human workload input. Based on the proposed method, the loss function and optimizing process of CNN are changed. By doing this, the learning procedure is improved to incorporate class-dependent penalties. The modified network is converted into cost-sensitive algorithm. Additionally, the cost matrix is updated according to overall class distributions and the classifier’s local performance.
Contributions from this paper are as follows:
This work jointly explores CIP and TSC tasks with deep learning methods. We propose an adaptive cost-sensitive learning strategy to convert CNN based networks into cost-sensitive algorithms. Cost-sensitive networks adaptively punish the misclassified positive samples. Penalties are automatically adjusted based on the overall class distribution and the CNN’s training performance. Our method has been extended to modify four CNN based networks: CNN, FCN, LSTM-FCN and ResNet. The proposed method is compared to previous approaches on five public benchmark UCR time series datasets [10] and a real-life imbalanced time series dataset [4]. The cost-sensitive learning strategy is effective for solving ITSC issues. The modified networks are superior to previous ITSC approaches. The cost-sensitive ResNet performes best.
Brief reviews of related works are given in Section 2. Section 3 describes the proposed method. Section 4 presents the evaluation measures, experimental results. Sections 5 and 6 provide the discussion and conclusion.
Data manipulation approaches aim to change the class distribution within preprocessing steps. Re-establishing training set in a random or synthetic way is independent of the underlying classification models. However, data sampling techniques change the distribution of raw data. In particular, over-sampling exaggerates the load of the calculation and results in over-fitting, while under-sampling fails to retain all useful information [20].
Algorithmic level approaches aim to change the insensitive models into minority sensitive ones. Cost-sensitive learning is an effective algorithmic approach to carry out CIP tasks. According to the predefined cost matrix or class weight, each misclassified sample is punished differently based on its class. Kukar and Kononenko [27] reported a fundamental cost-sensitive modification on a multilayered network. The loss function was changed to a fixed cost matrix to punish the misclassified examples. Zhou and Liu [51] empirically investigated a cost-sensitive neural network. The effects of sampling and threshold-moving on training stages were also discussed. A predefined statistic cost matrix can convert neural networks to cost-sensitive classifiers. However, the manual implementation on the cost matrix relied on professional judgments which are not always practical in real-life applications.
Recently, some works brought cost-sensitive deep neural networks into CIP domain. Khan et al. [25] proposed a cost-sensitive CNN with automatic feature represented. They synchronously optimized the parameters of CNNs with learnable cost weights to perform the cost-sensitive operations. Raj et al. [36] explored cost-sensitive CNNs with different loss functions. Wang et al. [43] defined the mean false error and the mean squared false error loss functions, which made networks more sensitive to the minority class. Buda et al. [5] investigated the CIP impacts on CNNs with three graphic datasets. It was found that the drawbacks of sampling may not affect CNN performance. However, the statement from graphic experiments was uncertain for imbalanced time series datasets.
In contrast to feature-based TSC algorithms, CNN can automatically capture the time-shift properties and invariant features from time series. It may seem that recurrent neural networks (like long short-term memory, LSTM) match TSC tasks naturally. But previous research has proven CNNs perform better [12]. Zheng et al. [49, 50] proposed a novel deep learning model named the multi-channels CNN for multivariate TSC. They automatically extracted temporal features with one-dimensional convolution layers. Soon after, Cui et al. [11] improved the temporal CNN by transforming the original time series into multiscale sets. Similarly, Wang et al. [44] presented an Earliness-Aware Deep Convolutional Network (EA-ConvNet) for early classification. They claimed the feature-based TSC algorithm shaplets were proposed as a special case of features learned by the EA-ConvNet. Besides, popular CNN based networks have shown competitive performance on TSC tasks. Wang et al. [45] explored FCN and ResNet on TSC tasks. Karim et al. [23] connected LSTM with FCN to recognize hidden patterns in sequences.
CNN based solutions can avoid heavy preprocessing or hand-craft feature engineering, but do not consider the ITSC cases. The performance, therefore, cannot be guaranteed on imbalanced time series datasets. Thus, this paper takes the advantage of CNN-based models to carry out the ITSC tasks in a cost-sensitive learning way.
Methods
Preliminaries
According to [11, 19], a time series can be defined as a time ordered real-values
Multi-channels temporal CNN structure.
Multi-channels temporal FCN structure.
Multi-channels temporal ResNet structure.
Multi-channels temporal LSTM-FCN structure.
In [49], the authors used one-dimensional convolutional layers to automatically extract features from multivariate time series raw data. The mechanism was applied to standard CNNs to perform TSC tasks. Within this paper, a similar network was applied, which includes convolutional and pooling layers, rectified linear units (ReLU) [15] and dropout operations. One-dimensional filters were operated as feature extractors. In addition, a fully connected layer performed the final binary classification. The details are depicted in Fig. 1. The numbers in grey shadow blocks stand for the number of convolution kernels, the dimension size and sliding window size, respectively. The dashed line under 0.5 indicates 50% dropout operation. Three other deep learning models were explored, which are depicted from Figs 2–4.
Cost-sensitive
Cost-sensitive learning treats majority and minority classes with different costs or class weights. According to the minimum expected cost principle, the expected risk of cost-sensitive learning is formulized as Eq. (1).
where
Unfortunately, it is virtually impossible to directly calculate the accurate posterior probability. Thus, previous works [17] replaced Eq. (1) with empirical risk.
where
Directly applying overall imbalanced ratio as the penalty may alleviate the CIP in an overall view. In this manner, the feedback of classifier’s performance is ignored. What is more, inserting a statistic cost or penalty into CNN does not invariable beneficial on imbalance learning. The imbalanced distribution of local areas like batch-wise training sets should be considered. Thus, our method used a dynamic cost matrix which can be updated adaptively. It is based on the imbalanced distributions of not only the whole training set but also local batch-wise subsets. The proposed cost matrix
where
In this paper, cross entropy was applied as the loss functions of the convolutional neural networks. The cross entropy loss of
where
Therefore, the overall error of cost-sensitive CNN can be expressed as Eq. (8).
From Eqs (7) and (8), the overall error can be formulized as follows.
For optimizing the parameters of CNN, the proposed cost-sensitive error was minimized as Eq. (10). Back Propagation algorithm was used to update the parameters of CNN with learning rate and calculated gradient after each learning step (as Eqs (11) and (12)). It should be noted that the proposed cost-matrix was updated after each training batch (as Eqs (4) and (5)).
where
Based on Eqs (4) and (5), a new cost matrix for separating the positive class and negative class samples was defined. If the predicted class belongs to majority, the relevant cost matrix element
The local metrics in Eq. (5) are updated after each batch. Consequently, the cost matrix is changed in an adaptive manner. Aiming at imbalance classification, a combination of local geometric mean and accuracy was used. The motivation for introducing local evaluation metrics is inspired by the bagging sampling methods. The input batches are treated as random samplers. Samples are taken from the training set and fed into the CNN randomly. Each batch contains at least one positive class sample and several negative ones. Training CNN with a class-dependent cost matrix is in favor of its cost-sensitive optimization.
Binary classification confusion matrix
The parameters of the CNN are updated based on the class distribution. In other words, the proposed strategy forces CNN from cost-insensitive to cost-sensitive with class imbalance effective learning. The cost-sensitive approach was summarized in Algorithm 1.
[h] Optimization for parameter
Performance evaluation metrics for CIP
Evaluation metrics selection affects the objectivity and fairness of final assessments. Most classifiers are empirically assessed by overall accuracy rate. However, since it cannot reflect false positive samples, accuracy is not an appropriate metric for CIP. Therefore, it is necessary to alternate between the overall accuracy and other effective metrics.
In Table 1 (confusion matrix [38]), TP and TN are the correctly classified positive and negative samples, respectively. FP and FN are the misclassified positive and negative samples, respectively.
True positive rate (TPR) is called as recall or sensitive, which reflects the correct classification proportions of positive samples.
True negative rate (TNR) is called as specificity, which reflects the correct classification proportions of negative samples.
Positive predictive value (PPV) is called as precision, which reflects the correct predicted proportions of all positive samples.
Dominance [14] can evaluate the relationship between TPR and TNR.
Receiver operating characteristic curve (ROC Curve) is plotting TPR against FPR, while precision recall curve (PR Curve) is plotting precision against recall. Usually, they are measured by the area under the curve (AUC). ROCAUC and PRAUC are applied in this paper.
The proposed method was tested on six datasets, which are shown in Table 2. Out of the six time series datasets, FaceAll, Swedish Leaf (S-Leaf), Adiac were multiclass and were converted into binary by selecting one class as the positive class while the others were negative (as in [6, 7, 31]). Five of the applied datasets were from the public UCR time series repository [10], and a real-life time series dataset was also used from AAIA16 Data Mining Challenge (Predicting Dangerous Seismic Events in Active Coal Mines [4]). The organizer Knowledge Pit1
A polish data challenge platform:
Experimental datasets
First, the cost-sensitive CNN (CS-CNN, see Fig. 1) was tested against several data-samplers and two existing ITSC methods Integrated Oversampling with SVM (INOS
Except normalization and one-hot encoding on the seismic dataset, the proposed method did not commit any manual feature engineering operations. All activation functions of hidden layers were applied with rectified linear units (ReLU) and the final output layers were sigmoid units. All networks were trained with Adam [45] in which the learning rate is 0.001,
Data-sampling methods
Data-sampling methods
Seventeen data samplers were selcted to use with standard CNN classifiers on five UCR datasets [10]. The performance of two existing ITSC methods (INOS
The proposed CS-CNN performed better than all data-sampling methods. It achieved satisfactory
Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric
on UCR datasets
Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric
Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric
For further validation, we extended our proposed cost-sensitive strategy to four convolutional neural networks, as mentioned in Section 3.2. CS-CNN, CS-FCN, CS-LSTM-FCN and CS-ResNet were compared with INOS
The results in Table 6 and Fig. 5 show that the four cost-sensitive networks performed better than the ITSC methods. The CS-ResNet came out ahead, with significantly better results. CS-CNN, CS-LSTM-FCN and CS-FCN were ranked two to four. These results show that the proposed method can carry out the ITSC tasks. NOS
Comparison of the performance of cost-sensitive strategy and ITSC methods on real-life datasets
Comparison of the performance of cost-sensitive strategy and ITSC methods on real-life datasets
Comparison of the performance of cost-sensitive strategy and ITSC methods.
Deep learning for TSC
In [49], the multi-channels deep CNN (MD-CNN) were used for TSC tasks and it was also enforsed in this paper. The temporal CNN was applied with different parameters and other popular CNN-based networks were explored [23, 45]. Differently, our goal is to apply deep learning algorithms on ITSCs. To avoid under-representing the minority time series samples, we proposed an adaptive cost-sensitive learning strategy to improve the above CNN-based networks, which were tested on six time series datasets.
Data sampling and ITSC
Data-sampling is the most direct approach for class imbalance identification, it changes the original distribution of the raw dataset. Data sampling also has drawbacks such as over-fitting, useful information discarding and time-consumptioning. Several data samplers were combined and investigated with temporal CNN. As [5] claimed, over-sampling does not necessarily lead to over-fitting of CNNs. However, the best sampler could not be determined definitively in this paper (see Tables 4–6).
Previous ITSC methods like over-sampling SVM [6] and combined sampling with bagging [31] were tested in this paper, resulting in outstanding performance on univariate time series datasets. Unfortunately, when facing large volume multi-variate time series, the methods did not work. Our novel method, however, avoided the drawbacks of data-sampling and was able to cope with large volume, high dimensional, imbalanced time series datasets.
Cost-sensitive strategy
This paper proposed a cost-sensitive method to improve CNN based networks. [36, 25] did similar works, applying CNNs for imbalanced classification. However, our approach deacribed above was novel. Previous research only considered the overall imbalanced ratio of the training set; while in this paper, the local performance was taken into account for updating the designed cost matrix.
Limit and future work
Limitation of this work include a lack of multi-classification considerations. Only the binary ITSC cases were analyzed. In [34], the authors segmented the time series into bins using time stamp information and converted time series regression tasks to ITSC tasks. Random under-sampling was applied in combining with SMOTE over-sampling. The authors offered the idea which inspired us on time series regression issues in the future work.
Conclusion
In this paper, we proposed an adaptive cost-sensitive learning strategy in order to adress the ITSC issue. Convolutional neural networks were converted to cost-sensitive algorithms. Tested on five public UCR datasets CS-CNN achieved convinving performance. It also putperformed a series of existing data samplers, as well as a traditional ITSC method. The approach was extended to other three CNN-based networks and compared to the two ITSC methods on a real-life large volume dataset. The cost-sensitive networks were superior to the other methods and the CS-ResNet performed best. The modified networks can address the large volume ITSC tasks effectively.
