Cost-sensitive convolutional neural networks for imbalanced time series classification

Abstract

Time series classification and class imbalance problem are two common issues in a multitude of real-life scenarios. This paper simultaneously explores both issues with deep convolution neural networks (CNNs). Because standard networks treat the majority and minority classes with same class weights, most CNN-based networks fail to classify imbalanced time series. Until recently, there is very little work applying deep learning to imbalanced time series classification (ITSC). Thus, we propose an adaptive cost-sensitive learning strategy to address the ITSC problem. The standard CNN is modified to a cost-sensitive network (CS-CNN), which is able to punish the misclassified samples using a class-dependent cost matrix. Moreover, this cost matrix is automatically updated based on overall class distribution and the CS-CNN’s training performance. The proposed method is extended to FCN, LSTM-FCN and ResNet. It is experimentally tested on five public benchmark UCR datasets and a real-life large volume dataset. Four cost-sensitive CNN-based networks are compared with several data samplers and two traditional ITSC methods. The modified networks are superior in all metrics. Results show that cost-sensitive networks successfully complete the ITSC tasks.

Keywords

Convolutional neural networks class imbalance problems cost-sensitive learning imbalanced time series classification

1. Introduction

Class imbalance problem (CIP) and time series classification (TSC) are the top two challenges in data mining and machine learning [48]. They have attracted increasing research enthusiasms from different communities in the past years. However, most of the previous research merely focuse on CIP or TSC separately [19, 20]. In fact, imbalanced time series classification (ITSC) [32] problems are frequently encountered in widespread scenarios, such as behavior detection [24], medical treatments [1, 40], sleep monitoring [8], and industrial hazards surveillance [12, 21, 42]. ITSC is a special case of TSC, where the minority or positive class is underrepresented [31, 32]. Because of the skewed time series datasets, classifiers tend to bias toward the less important majority or negative class [28]. Therefore, it is challenging and crucial to recognize rare events (positive class) accurately. However, misclassification of a few negative instances is tolerable [36]. As an example, in industrial surveillance, hazards occur as a positive class, which cannot be neglected by the monitoring system [21].

Previous research on CIPs are conducted at two levels. The first level is data manipulation or data-sampling. A skewed training set is re-established into balance by over-sampling or under-sampling, or by combing both techniques [6, 7, 22, 31, 32]. Algorithmic modification is the second level [37, 51], where classifiers are modified with higher costs or class weights for false positive samples. For both methods, conventional machine learning algorithms like K-nearest neighbors combined with danymic time warping (KNN-DTW) [47], support vectoer machine (SVM) [6, 22], decision tree [31] are applied. However, a few traditional algorithms need hand-crafted feature engineering to guarantee classification performance. This is time-consuming and impractical for large volume datasets. Recently, deep convolution neural networks (CNNs) have been explored on CIP and TSC tasks. A systematic investigation can be found in [5]. Alternatively, a CNN can capture the time-shift properties and invariant features from time series [11, 13, 23, 36, 44, 47, 50] automatically. Nevertheless, very few attempts have been made at addressing the ITSC problems with CNN-based algorithms.

In this paper, both CIP and TSC issues have been simultaneously solved with cost-sensitive deep learning algorithms. We propose an adaptive cost-sensitive learning strategy with a class-dependent cost matrix. Unlike previous efforts [37, 51], manual setting costs are not used, which reduces the human workload input. Based on the proposed method, the loss function and optimizing process of CNN are changed. By doing this, the learning procedure is improved to incorporate class-dependent penalties. The modified network is converted into cost-sensitive algorithm. Additionally, the cost matrix is updated according to overall class distributions and the classifier’s local performance.

Contributions from this paper are as follows:

1.
This work jointly explores CIP and TSC tasks with deep learning methods.
2.
We propose an adaptive cost-sensitive learning strategy to convert CNN based networks into cost-sensitive algorithms. Cost-sensitive networks adaptively punish the misclassified positive samples. Penalties are automatically adjusted based on the overall class distribution and the CNN’s training performance.
3.
Our method has been extended to modify four CNN based networks: CNN, FCN, LSTM-FCN and ResNet.
4.
The proposed method is compared to previous approaches on five public benchmark UCR time series datasets [10] and a real-life imbalanced time series dataset [4]. The cost-sensitive learning strategy is effective for solving ITSC issues. The modified networks are superior to previous ITSC approaches. The cost-sensitive ResNet performes best.

Brief reviews of related works are given in Section 2. Section 3 describes the proposed method. Section 4 presents the evaluation measures, experimental results. Sections 5 and 6 provide the discussion and conclusion.
2. Related works

Data manipulation approaches aim to change the class distribution within preprocessing steps. Re-establishing training set in a random or synthetic way is independent of the underlying classification models. However, data sampling techniques change the distribution of raw data. In particular, over-sampling exaggerates the load of the calculation and results in over-fitting, while under-sampling fails to retain all useful information [20].

Algorithmic level approaches aim to change the insensitive models into minority sensitive ones. Cost-sensitive learning is an effective algorithmic approach to carry out CIP tasks. According to the predefined cost matrix or class weight, each misclassified sample is punished differently based on its class. Kukar and Kononenko [27] reported a fundamental cost-sensitive modification on a multilayered network. The loss function was changed to a fixed cost matrix to punish the misclassified examples. Zhou and Liu [51] empirically investigated a cost-sensitive neural network. The effects of sampling and threshold-moving on training stages were also discussed. A predefined statistic cost matrix can convert neural networks to cost-sensitive classifiers. However, the manual implementation on the cost matrix relied on professional judgments which are not always practical in real-life applications.

Recently, some works brought cost-sensitive deep neural networks into CIP domain. Khan et al. [25] proposed a cost-sensitive CNN with automatic feature represented. They synchronously optimized the parameters of CNNs with learnable cost weights to perform the cost-sensitive operations. Raj et al. [36] explored cost-sensitive CNNs with different loss functions. Wang et al. [43] defined the mean false error and the mean squared false error loss functions, which made networks more sensitive to the minority class. Buda et al. [5] investigated the CIP impacts on CNNs with three graphic datasets. It was found that the drawbacks of sampling may not affect CNN performance. However, the statement from graphic experiments was uncertain for imbalanced time series datasets.

In contrast to feature-based TSC algorithms, CNN can automatically capture the time-shift properties and invariant features from time series. It may seem that recurrent neural networks (like long short-term memory, LSTM) match TSC tasks naturally. But previous research has proven CNNs perform better [12]. Zheng et al. [49, 50] proposed a novel deep learning model named the multi-channels CNN for multivariate TSC. They automatically extracted temporal features with one-dimensional convolution layers. Soon after, Cui et al. [11] improved the temporal CNN by transforming the original time series into multiscale sets. Similarly, Wang et al. [44] presented an Earliness-Aware Deep Convolutional Network (EA-ConvNet) for early classification. They claimed the feature-based TSC algorithm shaplets were proposed as a special case of features learned by the EA-ConvNet. Besides, popular CNN based networks have shown competitive performance on TSC tasks. Wang et al. [45] explored FCN and ResNet on TSC tasks. Karim et al. [23] connected LSTM with FCN to recognize hidden patterns in sequences.

CNN based solutions can avoid heavy preprocessing or hand-craft feature engineering, but do not consider the ITSC cases. The performance, therefore, cannot be guaranteed on imbalanced time series datasets. Thus, this paper takes the advantage of CNN-based models to carry out the ITSC tasks in a cost-sensitive learning way.

3. Methods

3.1 Preliminaries

According to [11, 19], a time series can be defined as a time ordered real-values ${T}=\left\{{t_{1},t_{2},\ldots,t_{l}}\right\}$ , where $l$ is time series length of ${T}$ . A multivariate time series is composed of univariate series, as ${\bm{S}}=\left({T_{1},T_{2},\ldots,T_{L}}\right)$ . A labeled time series dataset can be defined as ${\bm{D}}=\left\{{\left({S_{i},y_{i}}\right)}\right\}_{i=1}^{n}$ , where $y_{i}$ is the label of $i^{th}$ temporal sample of all $n$ samples. In this paper, the focus is on binary ITSC problems. Therefore, the relevant labels are denoted as ${y}_{i}\in\left\{{0,1}\right\}$ .

Figure 1.

Multi-channels temporal CNN structure.

Figure 2.

Multi-channels temporal FCN structure.

Figure 3.

Multi-channels temporal ResNet structure.

Figure 4.

Multi-channels temporal LSTM-FCN structure.

3.2 Temporal convolutional neural networks

In [49], the authors used one-dimensional convolutional layers to automatically extract features from multivariate time series raw data. The mechanism was applied to standard CNNs to perform TSC tasks. Within this paper, a similar network was applied, which includes convolutional and pooling layers, rectified linear units (ReLU) [15] and dropout operations. One-dimensional filters were operated as feature extractors. In addition, a fully connected layer performed the final binary classification. The details are depicted in Fig. 1. The numbers in grey shadow blocks stand for the number of convolution kernels, the dimension size and sliding window size, respectively. The dashed line under 0.5 indicates 50% dropout operation. Three other deep learning models were explored, which are depicted from Figs 2–4.

3.3 Cost-sensitive

Cost-sensitive learning treats majority and minority classes with different costs or class weights. According to the minimum expected cost principle, the expected risk of cost-sensitive learning is formulized as Eq. (1).

$\displaystyle R\left({i|{\bm{S}}}\right)=\mathop{\sum}\limits_{j}P\left({j|{% \bm{S}}}\right)C\left({j,i}\right),$ (1)

where $R\left({i|{\bm{S}}}\right)$ is the expected risk of categorizing the given input ${\bm{S}}$ into class $i$ . $P\left({j|{\bm{S}}}\right)$ denotes the posterior probability when the given input belongs to class $j$ . $C\left({j,i}\right)$ is the misclassification cost when class $i$ is misrecognized as class $j$ .

Unfortunately, it is virtually impossible to directly calculate the accurate posterior probability. Thus, previous works [17] replaced Eq. (1) with empirical risk.

$\displaystyle\hat{R}_{l}=E_{{\bm{S}},{\bm{Y}}}[l]=\frac{1}{n}\sum_{i=1}^{n}l({% \bm{C}},y^{(i)},o^{(i)}),$ (2) $\displaystyle C_{p,q}=\left\{{{\begin{array}[]{*{20}c}{1,p=q}\\ {R,p\neq q}\\ \end{array}},}\right.$ (3)

where $\hat{R}_{l}$ is the empirical risk. $E$ is the minimum expectation. The overall number of samples is $n$ . $C$ is the cost matrix. The penalty is valued with a constant $R$ when the predicted class $q$ mismatches the actual class $p$ . $o^{\left(i\right)}$ and $y^{\left(i\right)}$ are the $i^{th}$ predicted output and desired output, respectively. $l$ represents the loss function.

3.4 The proposed cost-sensitive learning strategy

Directly applying overall imbalanced ratio as the penalty may alleviate the CIP in an overall view. In this manner, the feedback of classifier’s performance is ignored. What is more, inserting a statistic cost or penalty into CNN does not invariable beneficial on imbalance learning. The imbalanced distribution of local areas like batch-wise training sets should be considered. Thus, our method used a dynamic cost matrix which can be updated adaptively. It is based on the imbalanced distributions of not only the whole training set but also local batch-wise subsets. The proposed cost matrix ${\bm{C}}^{\prime}$ was defined as Eqs (4) and (5).

$\displaystyle{\bm{C}}^{\prime}_{p}=\left[{\lambda_{1,p},\ldots,\lambda_{i,p},% \ldots,\lambda_{{B},p}}\right],$ (4) $\displaystyle\lambda_{i,p}=\left\{{{\begin{array}[]{ll}\text{IR}^{\text{% overall}}\times\exp\left({-\frac{G_{\text{mean}}^{\text{batch}}}{2}}\right)% \times\exp\left({-\frac{\text{ACC}^{\text{batch}}}{2}}\right),&\text{if }p\in% \text{ Pos}\\ 1,&\text{if }p\in\text{ Neg}\\ \end{array}}}\right.,$ (5)

where $p$ is the predict class and ${B}$ denotes batch size. $\text{IR}^{\text{overall}}$ is the overall imbalance ratio. $G_{\text{mean}}^{\text{batch}}$ and $\text{ACC}^{\text{batch}}$ are geometric value and accuracy of the current batch training examples respectively.

In this paper, cross entropy was applied as the loss functions of the convolutional neural networks. The cross entropy loss of $i^{th}$ training instance can be expressed by Eqs (6) and (7).

$\displaystyle l\left({{\bm{\theta}},C^{\prime}_{i,p}}\right)={y}_{i}\times[-% \ln\left({{o}_{i}\left({{\bm{\theta}},C^{\prime}_{i,p}}\right)}\right)]+\left(% {1-{y}_{i}}\right)\times[-\ln(1-{o}_{i}({\bm{\theta}},C^{\prime}_{i,p}))],$ (6) $\displaystyle{o}_{i}\left({{\bm{\theta}},C^{\prime}_{i,p}}\right)=\frac{1}{1+% \exp\left({-\theta{\bm{x}}_{i}C^{\prime}_{i,p}}\right)},$ (7)

where ${\bm{\theta}}$ is the weight parameters of applied classifier (like CNN). The proposed misclassification cost matrix is $C^{\prime}_{i,p}$ . ${\bm{x}}$ denotes the input of sigmoid layer.

Therefore, the overall error of cost-sensitive CNN can be expressed as Eq. (8).

$\displaystyle E\left({{\bm{\theta}},C^{\prime}_{i,p}}\right)=-\frac{1}{n}% \mathop{\sum}\limits_{i=1}^{n}[{y}_{i}\log{o}_{i}\left({{\bm{\theta}},C^{% \prime}_{i,p}}\right)+\left({1-{y}_{i}}\right)\log\left({1-{o}_{i}\left({{\bm{% \theta}},C^{\prime}_{i,p}}\right)}\right)],$ (8)

From Eqs (7) and (8), the overall error can be formulized as follows.

$\displaystyle E\left({{\bm{\theta}},C^{\prime}_{p}}\right)=-\frac{1}{n}\mathop% {\sum}\limits_{i=1}^{n}\left[{y}_{i}\log\frac{1}{1+\exp\left({-{\bm{\theta x}}% _{i}C^{\prime}_{i,p}}\right)}\right.\left.+\left({1-{y}_{i}}\right)\log\left({% 1-\frac{1}{1+\exp\left({-{\bm{\theta x}}_{i}C^{\prime}_{i,p}}\right)}}\right)% \right]=-\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}\left[{{y}_{i}{\bm{\theta x}% }_{i}-\log\left({1+\frac{1}{\exp\left({{\bm{\theta x}}_{i}C^{\prime}_{i,p}}% \right)}}\right)}\right],$ (9)

For optimizing the parameters of CNN, the proposed cost-sensitive error was minimized as Eq. (10). Back Propagation algorithm was used to update the parameters of CNN with learning rate and calculated gradient after each learning step (as Eqs (11) and (12)). It should be noted that the proposed cost-matrix was updated after each training batch (as Eqs (4) and (5)).

$\displaystyle\left({{\bm{\theta}}^{*}}\right)=\text{argmin}E\left({\bm{\theta}% }\right),$ (10) $\displaystyle{\bm{\theta}}_{j+1}={\bm{\theta}}_{j}-\eta\nabla E\left({{\bm{% \theta}}_{j}}\right),$ (11) $\displaystyle\nabla E\left({{\bm{\theta}}_{j}}\right)=\frac{\partial}{\partial% {\bm{\theta}}_{j}}\left\{{-\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}\left[{{y}% _{i}{\bm{\theta x}}_{i}-\log\left({1+\frac{1}{\exp\left({{\bm{\theta x}}_{i}C^% {\prime}_{i,p}}\right)}}\right)}\right]}\right\}=\frac{1}{{n}}\mathop{\sum}% \limits_{i=1}^{n}C^{\prime}_{i,p}{\bm{x}}_{i}\left({{o}_{i}-{y}_{i}}\right),$ (12)

where $\eta$ is the learning rate and ${\bm{\theta}}_{j}$ is the current parameters of CNN.

Based on Eqs (4) and (5), a new cost matrix for separating the positive class and negative class samples was defined. If the predicted class belongs to majority, the relevant cost matrix element ${\lambda}$ is valued set to one. Therefore, the loss of the CNN is the same as its normal version. Conversely, when the output label is predicted as minority class, the penalty is calculated in Eq. (5). The parameters of the CNN are updated with Eq. (13). Assisting in the penalty is ${\lambda}^{\text{Pos}}>1$ , where ${\bm{\theta}}_{j+1}$ changes faster with a higher gradient than the negative cases. Namely, the CNN is trained with more emphasis placed on positive class samples.

The local metrics in Eq. (5) are updated after each batch. Consequently, the cost matrix is changed in an adaptive manner. Aiming at imbalance classification, a combination of local geometric mean and accuracy was used. The motivation for introducing local evaluation metrics is inspired by the bagging sampling methods. The input batches are treated as random samplers. Samples are taken from the training set and fed into the CNN randomly. Each batch contains at least one positive class sample and several negative ones. Training CNN with a class-dependent cost matrix is in favor of its cost-sensitive optimization.

Table 1

Binary classification confusion matrix

	Actual Negative	Actual Positive
Predict Negative	True Negative (TN)	False Negative (FN)
Predict Positive	False Positive (FP)	True Positive (TP)

The parameters of the CNN are updated based on the class distribution. In other words, the proposed strategy forces CNN from cost-insensitive to cost-sensitive with class imbalance effective learning. The cost-sensitive approach was summarized in Algorithm 1.

[h] Optimization for parameter $\theta$ of cost-sensitive CNN[1] Imbalanced time-series set, Maximum epoch: M, Batch Size: B, learning rate: $\eta$ $\theta^{*}$ Imbalanced time-series set, Maximum epoch M, Batch Size: B, learning rate: $\eta$ Randomly initialize weight $\theta$ Calculate the overall imbalanced ratio (IR) Assign batches from training set randomly for [1, M] do for [1, B] do forward passing Calculate cost matrix as Eqs (4) and (5) Calculate loss as Eq. (9) Calculate gradient of $\theta$ as Eq. (12) Update $\theta$ return $\theta^{*}$

4. Experiments and results

4.1 Performance evaluation metrics for CIP

Evaluation metrics selection affects the objectivity and fairness of final assessments. Most classifiers are empirically assessed by overall accuracy rate. However, since it cannot reflect false positive samples, accuracy is not an appropriate metric for CIP. Therefore, it is necessary to alternate between the overall accuracy and other effective metrics.

In Table 1 (confusion matrix [38]), TP and TN are the correctly classified positive and negative samples, respectively. FP and FN are the misclassified positive and negative samples, respectively.

(1)
True positive rate (TPR) is called as recall or sensitive, which reflects the correct classification proportions of positive samples.

$\displaystyle\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}}$ (13)
(2)
True negative rate (TNR) is called as specificity, which reflects the correct classification proportions of negative samples.

$\displaystyle\text{TNR}=\frac{\text{TN}}{\text{FP}+\text{TN}}$ (14)
(3)
Positive predictive value (PPV) is called as precision, which reflects the correct predicted proportions of all positive samples.

$\displaystyle\text{PPV}=\frac{\text{TP}}{\text{TP}+\text{FP}}$ (15)
(4)
$F_{1}$ score is the harmonic mean of PPV and TPR.

$\displaystyle{F}_{1}=\frac{2\text{PPV}\times\text{TPR}}{\text{PPV}+\text{TPR}}$ (16)
(5)
$G_{\text{mean}}$ is the geometric mean of TNR and TPR.

$\displaystyle{G}_{\text{mean}}=\sqrt{\text{TPR}\times\text{TNR}}$ (17)
(6)
Dominance [14] can evaluate the relationship between TPR and TNR.

$\displaystyle\text{Dominance}=\text{TPR}-\text{TNR}$ (18)
(7)
Receiver operating characteristic curve (ROC Curve) is plotting TPR against FPR, while precision recall curve (PR Curve) is plotting precision against recall. Usually, they are measured by the area under the curve (AUC). ROCAUC and PRAUC are applied in this paper.

4.2 Experimental dataset

The proposed method was tested on six datasets, which are shown in Table 2. Out of the six time series datasets, FaceAll, Swedish Leaf (S-Leaf), Adiac were multiclass and were converted into binary by selecting one class as the positive class while the others were negative (as in [6, 7, 31]). Five of the applied datasets were from the public UCR time series repository [10], and a real-life time series dataset was also used from AAIA16 Data Mining Challenge (Predicting Dangerous Seismic Events in Active Coal Mines [4]). The organizer Knowledge Pit1

¹
A polish data challenge platform: https://knowledgepit.fedcsis.org/.

offered a large volume dataset to predict increased coal mine seismic activities that endanger coal workers working underground. Note that all these datasets indicated in Table 2 were well prepared, cleaned of malformed and erroneous values, without missing attributes. Because these datasets are already splited, the original train/test split was used.

Table 2

Experimental datasets

Datasets	Training			Test
	Pos	Neg	IR	Pos	Neg
FaceAll	40	520	13.00	71	1619
S-Leaf	30	470	15.67	47	578
Adiac	11	379	34.45	14	377
Wafer	98	902	9.20	666	5498
Yoga	136	164	1.21	1392	1608
Coal mine seismic	2963	130188	43.94	196	3664

4.3 Experiment setup

First, the cost-sensitive CNN (CS-CNN, see Fig. 1) was tested against several data-samplers and two existing ITSC methods Integrated Oversampling with SVM (INOS $+$ SVM) [6] and Hybrid sample with bagging (H-sample $+$ Bagging) [31] on five UCR datasets. Then the four temporal CNN based models mentioned in Section 3.2 were implemented into the proposed method against INSO $+$ SVM and H-sample $+$ Bagging on the coal mine seismic dataset.

Except normalization and one-hot encoding on the seismic dataset, the proposed method did not commit any manual feature engineering operations. All activation functions of hidden layers were applied with rectified linear units (ReLU) and the final output layers were sigmoid units. All networks were trained with Adam [45] in which the learning rate is 0.001, $\beta_{1}=$ 0.9, $\beta_{2}=$ 0.999 and $\varepsilon=$ 1e $-$ 8. The compared data-sampling approaches are displayed in Table 3. For comprehensive and objective evaluating, this work adopted 10-fold stratified cross validation. The final results are average performance values. Batch size was fixed at 128 and 512 for UCR and seismic datasets, respectively. The aforementioned convolutional neural networks were implemented in the deep learning framework TensorFlow. Experiments were accelerated on a GTX1080 GPU. Parts of data-sampling methods were imported from a Python package called Imbalanced-Learn [30]. All experiments were executed on a PC with an Intel i7-6700K 4.0 GHz processor and 32 GB of RAM.

Table 3
Data-sampling methods

Over-sample	Under-sampling	Combined sampling
Random over sampling	Random under sampling	SMOTE $+$ ENN [3]
SMOTE [9]	IHT [39]	SMOTE $+$ TL [2]
SMOTE B1 [18]	NM [33]	H-sample [31]
SMOTE B2 [18]	TL [41]
SMOTE SVM [35]	ENN [46]
ADASYN [16]	OSS [26]
INOS [6]	NCR [29]

4.4 Comparison of the performance of cost-sensitive learning strategy, data-sampling and ITSC methods on UCR datasets

Seventeen data samplers were selcted to use with standard CNN classifiers on five UCR datasets [10]. The performance of two existing ITSC methods (INOS $+$ SVM [6] and H-sample combined with bagging [31]) were also exhibited. The above methods are compared with the proposed cost-sensitive CNN (CS-CNN, see Fig. 1) on two evaluation metrics: $F_{1}$ and $G_{\text{mean}}$ as shown in Tables 4 and 5. The best result in each metric is emphasized in bold type and the last row Rank is the average rank score of each method on the five UCR datasets.

The proposed CS-CNN performed better than all data-sampling methods. It achieved satisfactory $F_{1}$ and $G_{\text{mean}}$ values of 0.913 and 0.958 on the five UCR datasets, respectively. These were significantly better than the seventeen data-sampling methods and an ITSC method (INOS $+$ SVM). The other ITSC approach (H-sampling $+$ Bagging) achieved the top ranking with a slight advantage. However, the proposed CS-CNN was effective on imbalanced time series classification tasks.

Table 4
Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric $F_{1}$ on UCR datasets

Methods	$F_{1}$
	FaceAll	S-Leaf	Adiac	Wafer	Yoga	Average	STD	Rank
ROS $+$ CNN	0.234	0.481	0.199	0.742	0.827	0.497	0.286	14.60
SMOTE $+$ CNN	0.250	0.515	0.193	0.749	0.829	0.507	0.286	10.00
SMOTE B1 $+$ CNN	0.254	0.500	0.193	0.756	0.827	0.506	0.286	11.40
SMOTE B2 $+$ CNN	0.263	0.512	0.183	0.747	0.814	0.504	0.281	13.60
SMOTE VM $+$ CNN	0.259	0.492	0.186	0.745	0.828	0.502	0.285	13.20
ADASYN $+$ CNN	0.253	0.508	0.197	0.746	0.824	0.506	0.282	11.60
INOS $+$ CNN	0.245	0.549	0.208	0.736	0.775	0.503	0.266	12.40
RUS $+$ CNN	0.251	0.503	0.190	0.767	0.821	0.506	0.288	13.40
IHT $+$ CNN	0.242	0.515	0.184	0.792	0.825	0.512	0.299	11.60
NW $+$ CNN	0.243	0.497	0.197	0.787	0.820	0.509	0.292	12.60
TL $+$ CNN	0.243	0.500	0.191	0.783	0.820	0.507	0.293	13.60
ENN $+$ CNN	0.257	0.497	0.198	0.769	0.826	0.509	0.287	10.80
OSS $+$ CNN	0.250	0.514	0.195	0.796	0.825	0.516	0.295	9.60
NCR $+$ CNN	0.242	0.500	0.191	0.779	0.824	0.507	0.294	14.00
SMOTE ENN $+$ CNN	0.241	0.503	0.191	0.774	0.827	0.507	0.293	12.40
SMOTE TL $+$ CNN	0.237	0.545	0.183	0.773	0.829	0.513	0.297	11.80
H-Sample $+$ CNN	0.266	0.807	0.503	0.807	0.829	0.642	0.250	4.00
INOS $+$ SVM [6]	0.936	0.904	0.800	0.985	0.724	0.870	0.106	5.80
H-Sample $+$ Bagging [31]	0.995	0.932	0.975	0.980	0.926	0.962	0.031	1.40
CS-CNN	0.939	0.931	0.963	0.987	0.839	0.932	0.056	1.80

Table 5

Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric $G_{\text{mean}}$ on UCR datasets

Methods	$G_{\text{mean}}$
	FaceAll	S-Leaf	Adiac	Wafer	Yoga	Average	STD	Rank
ROS $+$ CNN	0.365	0.564	0.332	0.742	0.832	0.567	0.222	15.40
SMOTE $+$ CNN	0.378	0.594	0.326	0.773	0.838	0.582	0.229	9.80
SMOTE B1 $+$ CNN	0.381	0.581	0.326	0.780	0.833	0.580	0.228	11.00
SMOTE B2 $+$ CNN	0.389	0.590	0.318	0.772	0.820	0.578	0.224	13.20
SMOTE SVM $+$ CNN	0.386	0.573	0.320	0.771	0.836	0.577	0.228	13.00
ADASYN $+$ CNN	0.380	0.586	0.335	0.771	0.832	0.581	0.224	10.40
INOS $+$ CNN	0.374	0.617	0.341	0.763	0.782	0.575	0.209	12.20
RUS $+$ CNN	0.379	0.581	0.328	0.788	0.828	0.581	0.228	11.80
IHT $+$ CNN	0.372	0.597	0.319	0.809	0.831	0.585	0.238	12.00
NW $+$ CNN	0.372	0.577	0.331	0.806	0.825	0.582	0.233	12.80
TL $+$ CNN	0.372	0.579	0.325	0.802	0.828	0.581	0.234	13.60
ENN $+$ CNN	0.384	0.577	0.332	0.790	0.833	0.583	0.228	10.40
OSS $+$ CNN	0.378	0.590	0.329	0.813	0.833	0.589	0.235	9.20
NCR $+$ CNN	0.371	0.579	0.325	0.798	0.829	0.580	0.234	14.20
SMOTE ENN $+$ CNN	0.371	0.581	0.325	0.830	0.795	0.580	0.233	13.80
SMOTE TL $+$ CNN	0.367	0.614	0.314	0.832	0.794	0.584	0.238	13.40
H-Sample $+$ CNN	0.392	0.826	0.581	0.823	0.836	0.692	0.199	4.40
INOS $+$ SVM [6]	0.945	0.962	0.882	0.985	0.750	0.905	0.095	6.40
H-Sample $+$ Bagging [31]	0.997	0.976	0.989	0.988	0.976	0.985	0.009	1.40
CS-CNN	0.963	0.964	0.999	0.991	0.872	0.958	0.050	1.60

4.5 Comparison of the performance of cost-sensitive learning strategy and ITSC methods on a real-life large volume dataset

For further validation, we extended our proposed cost-sensitive strategy to four convolutional neural networks, as mentioned in Section 3.2. CS-CNN, CS-FCN, CS-LSTM-FCN and CS-ResNet were compared with INOS $+$ SVM and H-sample $+$ Bagging on a real-life large volume ITSC dataset. Results are represented in Table 6 and Fig. 5. Considering the AAIA16 Data Mining Challenge used ROAUC as the performance evaluation, we added metrics Dominance, ROAUC and PRAUC in this experiment to round out the comprehensive testing. The best results and relevant standard deviation on each metrics are emphasized in bold-face type and are shown in parenthesis, respectively.

The results in Table 6 and Fig. 5 show that the four cost-sensitive networks performed better than the ITSC methods. The CS-ResNet came out ahead, with significantly better results. CS-CNN, CS-LSTM-FCN and CS-FCN were ranked two to four. These results show that the proposed method can carry out the ITSC tasks. NOS $+$ SVM and H-sample $+$ Bagging had less than 0.5 results in this experiment. The explanation for this is that they are not suitable for large volume time series dataset. The characteristics of deep learning, however, showcase its effective end-to-end classification on large volume time series datasets.

Table 6
Comparison of the performance of cost-sensitive strategy and ITSC methods on real-life datasets

Methods	$F_{1}$		$G_{\text{mean}}$		ROCAUC		PRAUC		Dominance	Average rank
INOS $+$ SVM	0.048	(0.0510)	0.500	(0.1756)	0.500	(0.0004)	0.275	(0.2635)	$-$ 0.949	5.00
H-Sample $+$ Bagging	0.161	(0.0924)	0.360	(0.1222)	0.565	(0.0745)	0.230	(0.1096)	$-$ 0.807	5.40
CS-CNN	0.441	(0.0141)	0.537	(0.0111)	0.908	(0.0048)	0.363	(0.0208)	$-$ 0.593	2.00
CS-FCN	0.275	(0.0977)	0.520	(0.0595)	0.703	(0.1000)	0.150	(0.0625)	$-$ 0.569	3.80
CS-ResNet	0.455	(0.0259)	0.563	(0.0121)	0.916	(0.0064)	0.346	(0.0189)	$-$ 0.667	1.80
CS-LSTM-FCN	0.354	(0.0951)	0.453	(0.1130)	0.904	(0.0111)	0.343	(0.0219)	$-$ 0.561	3.00

Figure 5.

Comparison of the performance of cost-sensitive strategy and ITSC methods.

5. Discussion

5.1 Deep learning for TSC

In [49], the multi-channels deep CNN (MD-CNN) were used for TSC tasks and it was also enforsed in this paper. The temporal CNN was applied with different parameters and other popular CNN-based networks were explored [23, 45]. Differently, our goal is to apply deep learning algorithms on ITSCs. To avoid under-representing the minority time series samples, we proposed an adaptive cost-sensitive learning strategy to improve the above CNN-based networks, which were tested on six time series datasets.

5.2 Data sampling and ITSC

Data-sampling is the most direct approach for class imbalance identification, it changes the original distribution of the raw dataset. Data sampling also has drawbacks such as over-fitting, useful information discarding and time-consumptioning. Several data samplers were combined and investigated with temporal CNN. As [5] claimed, over-sampling does not necessarily lead to over-fitting of CNNs. However, the best sampler could not be determined definitively in this paper (see Tables 4–6).

Previous ITSC methods like over-sampling SVM [6] and combined sampling with bagging [31] were tested in this paper, resulting in outstanding performance on univariate time series datasets. Unfortunately, when facing large volume multi-variate time series, the methods did not work. Our novel method, however, avoided the drawbacks of data-sampling and was able to cope with large volume, high dimensional, imbalanced time series datasets.

5.3 Cost-sensitive strategy

This paper proposed a cost-sensitive method to improve CNN based networks. [36, 25] did similar works, applying CNNs for imbalanced classification. However, our approach deacribed above was novel. Previous research only considered the overall imbalanced ratio of the training set; while in this paper, the local performance was taken into account for updating the designed cost matrix.

5.4 Limit and future work

Limitation of this work include a lack of multi-classification considerations. Only the binary ITSC cases were analyzed. In [34], the authors segmented the time series into bins using time stamp information and converted time series regression tasks to ITSC tasks. Random under-sampling was applied in combining with SMOTE over-sampling. The authors offered the idea which inspired us on time series regression issues in the future work.

6. Conclusion

In this paper, we proposed an adaptive cost-sensitive learning strategy in order to adress the ITSC issue. Convolutional neural networks were converted to cost-sensitive algorithms. Tested on five public UCR datasets CS-CNN achieved convinving performance. It also putperformed a series of existing data samplers, as well as a traditional ITSC method. The approach was extended to other three CNN-based networks and compared to the two ITSC methods on a real-life large volume dataset. The cost-sensitive networks were superior to the other methods and the CS-ResNet performed best. The modified networks can address the large volume ITSC tasks effectively.

References

Acharya

U.R.

S.L.

Hagiwara

Tan

J.H.

Adam

Gertych

and Tan

R.S.

, A deep convolutional neural network model to classify heartbeats, Computers in Biology and Medicine 89 (2017), 389–396.

Batista

G.E.

Bazzan

A.L.

and Monard

M.C.

, Balancing Training Data for Automated Annotation of Keywords: a Case Study, in: Proceedings of the Second Brazilian Workshop on Bioinformatics, BSB, Macaé, RJ, 2003, pp. 35–43.

Batista

G.E.

Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter 6 (2004), 20–29.

Boullé

, Predicting Dangerous Seismic Events in Coal Mines under Distribution Drift, in: 2016 Federated Conference on Computer Science and Information Systems (FedCSIS), IEEE, Gdansk, Poland, 2016, pp. 221–224.

Buda

Maki

and Mazurowski

M.A.

, A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks, arXiv preprint arXiv:1710.05381, 2017.

Cao

X.L.

Woon

D.Y.K.

and Ng

S.K.

, Integrated oversampling for imbalanced time series classification, IEEE Transactions on Knowledge and Data Engineering 25 (2013), 2809–2822.

Cao

X.L.

Woon

Y.K.

and Ng

S.K.

, SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification, in: 2011 IEEE 11th International Conference on Data Mining, IEEE, Vancouver, BC, Canada, 2011, pp. 1008–1013.

Chambon

Galtier

Arnal

Wainrib

and Gramfort

, A deep learning architecture for temporal sleep stage classification using multivariate and multimodal time series, IEEE Ttransactions on Neural Systems and Rehabilitation Engineering 26 (2018), 758–769.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

and Kegelmeyer

W.P.

, SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

10.

Chen

Keogh

Begum

Bagnall

Mueen

and Batista

, The UCR time series classification archive (2015), URL www. cs. ucr. edu/

\sim

eamonn/time_series_data (2016).

11.

Cui

Chen

and Chen

, Multi-scale convolutional neural networks for time series classification, arXiv preprint arXiv:1603.06995, 2016.

12.

Günnemann

and Pfeffer

, Predicting Defective Engines using Convolutional Neural Networks on Temporal Vibration Signals, in: Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR, Dublin, Ireland, 2017, pp. 92–102.

13.

Gamboa

J.C.B.

, Deep Learning for Time-Series Analysis, arXiv preprint arXiv:1701.01887, 2017.

14.

García

Mollineda

R.A.

and Sánchez

J.S.

, Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions, in: Conference on Pattern Recognition and Image Analysis, Springer, Berlin, Heidelberg, 2009, pp. 441–448.

15.

Glorot

Bordes

and Bengio

, Deep Sparse Rectifier Neural Networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, PMLR, Fort Lauderdale, FL, USA, 2011, pp. 315–323.

16.

Haibo

Yang

Garcia

E.A.

and Shutao

, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), IEEE, Hong Kong, China, 2008, pp. 1322–1328.

17.

Haixiang

Yijing

Shang

Mingyun

Yuanyue

and Bing

, Learning from class-imbalanced data: review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

18.

Han

Wang

W.-Y.

and Mao

B.-H.

, Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, in: International Conference on Intelligent Computing, Springer, Hefei, China, 2005, pp. 878–887.

19.

Duan

Peng

Jing

Qian

and Wang

, Early classification on multivariate time series, Neurocomputing 149 (2015), 777–787.

20.

and Garcia

E.A.

, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (2009), 1263–1284.

21.

Janusz

Grzegorowski

Michalak

Wróbel

Sikora

and Slezak

, Predicting seismic events in coal mines based on underground sensor measurements, Engineering Applications of Artifical Intelligence 64 (2017), 83–94.

22.

Köknar-Tezel

and Latecki

L.J.

, Improving SVM classification on imbalanced time series data sets with ghost points, Knowledge and Information Systems 28 (2011), 1–23.

23.

Karim

Majumdar

Darabi

and Chen

, LSTM fully convolutional networks for time series classification, IEEE Access 6 (2018), 1662–1669.

24.

Kasfi

K.T.

Hellicar

and Rahman

, Convolutional Neural Network for Time Series Cattle Behaviour Classification, in: Proceedings of the Workshop on Time Series Analytics and Applications, ACM, Hobart, TAS, Australia, 2016, pp. 8–12.

25.

Khan

S.H.

Hayat

Bennamoun

Sohel

F.A.

and Togneri

, Cost-sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems (2018), 1–15.

26.

Kubat

and Matwin

, Addressing the Curse of Imbalanced Training Sets: One-Sided Selection, in: Proceedings of the Fourteen International Conference on Machine Learning, ACM, Nashville, USA, 1997, pp. 179–186.

27.

Kukar

and Kononenko

, Cost-Sensitive Learning with Neural Networks, in: European Conferene on Artifiial Intelligence, John Wiley and Sons, Brighton, UK, 1998, pp. 445–449.

28.

López

Fernández

Garcíc

Palade

and Herrera

, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Information Sciences 250 (2013), 113–141.

29.

Laurikkala

, Improving Identification of Difficult Small Classes by Balancing Class Distribution, in: Conference on Artificial Intelligence in Medicine in Europe, Springer, Cascais, Portugal, 2001, pp. 63–66.

30.

Lemaîre

Nogueira

and Aridas

C.K.

, Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (2017), 1–5.

31.

Liang

, An Effective Method for Imbalanced Time Series Classification: Hybrid Sampling, in: Proceedings of the 26th Australasian Joint Conference on AI 2013, Springer, Dunedin, NewZealand, 2013, pp. 374–385.

32.

Liang

and Zhang

, A Comparative Study of Sampling Methods and Algorithms for Imbalanced Time Series Classification, in: Autralasial Joint Conference on Artificial Intelligence, Springer, Melbourne, VIC, Australia, 2012, pp. 637–648.

33.

Mani

and Zhang

, KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction, in: Proceedings of the ICML’03 Workshop on Learning from Imbalanced Datasets, AAAI, Washington, DC, USA, 2003.

34.

Moniz

Branco

and Torgo

, Resampling Strategies for Imbalanced Time Series, in: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), IEEE, Montreal, QC, Canada, 2016, pp. 282–291.

35.

Nguyen

H.M.

Cooper

E.W.

and Kamei

, Borderline over-sampling for imbalanced data classification, International Journal of Knowledge Engineering and Soft Data Paradigms 3 (2011), 4–21.

36.

Raj

Magg

and Wermter

, Towards effective classification of imbalanced data with convolutional neural networks, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer, ULM, Germany, 2016, pp. 150–162.

37.

Roychoudhury

Ghalwash

and Obradovic

, Cost Sensitive Time-Series Classification, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, Skopje, Macedonia, 2017, pp. 495–511.

38.

Sammut

and Webb

G.I.

, Encyclopedia of Machine Learning, Springer Publishing Company, Incorporated, 2011, pp. 209–210.

39.

Smith

M.R.

Martinez

and Giraud-Carrier

, An instance level analysis of data complexity, Machine Learning 95 (2014), 225–256.

40.

Thodoroff

Pineau

and Lim

, Learning Robust Features using Deep Learning for Automatic Seizure Detection, in: Proceedings of the 1st Machine Learning for Healthcare Conference, PMLR, Los Angeles, CA, USA, 2016, pp. 178–190.

41.

Tomek

, Two modifications of CNN, IEEE Trans. Systems, Man and Cybernetics 6 (1976), 769–772.

42.

Wang

Feng

and Han

, Fault detection for the class imbalance problem in semiconductor manufactoring provesses, Journal of Circuits, Systems and Computers 23 (2014), 1450049: 1–20.

43.

Wang

Liu

Cao

Meng

and Kennedy

P.J.

, Training Deep Neural Networks on Imbalanced Data Sets, in: 2016 International Joint Conference on Neural Networks (IJCNN), IEEE, Vancouver, BC, Canada, 2016, pp. 4368–4374.

44.

Wang

Chen

Wang

Rai

and Carin

, Earliness-Aware Deep Convolutional Networks for Early Time Series Classification, arXiv preprint arXiv:1611.04578, 2016.

45.

Wang

Yan

and Oates

, Time Series Classification from Scratch with Deep Neural Networks: A Strong Baseline, in: 2017 International Joint Conference on Neural Networks (IJCNN), IEEE, Anchorage, AK, USA, 2017, pp. 1578–1585.

46.

Wilson

D.L.

, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics SMC-2 (1972), 408–421.

47.

Keogh

Shelton

Wei

and Ratanamahatana

C.A.

, Fast Time Series Classification Using Numerosity Reduction, in: Proceedings of the 23rd International Conference on Machine Learning, ACM, Pittsburgh, Pennsylvania, USA, 2006, pp. 1033–1040.

48.

Yang

and Wu

, 10 challenging problems in data mining research, International Journal of Information Technology & Decision Making 5 (2006), 597–604.

49.

Zheng

Liu

Chen

and Zhao

J.L.

, Time Series Classification Using Multi-Channels Deep Convolutional Neural Networks, in: International Conference on Web-Age Information Management, Springer, Nanchang, China, 2014, pp. 298–310.

50.

Zheng

Liu

Chen

and Zhao

J.L.

, Exploiting multi-channels deep convolutional neural networks for multivariate time series classification, Frontiers of Computer Science 10 (2016), 96–112.

51.

Zhou

Z.-H.

and Liu

X.-Y.

, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Transactions on Knowledge and Data Engineering 18 (2006), 63–77.

Cost-sensitive convolutional neural networks for imbalanced time series classification

Abstract

Keywords

1. Introduction

3. Methods

3.1 Preliminaries

3.3 Cost-sensitive

4.1 Performance evaluation metrics for CIP

1 A polish data challenge platform: https://knowledgepit.fedcsis.org/.

Table 3 Data-sampling methods

Table 4 Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric F 1 on UCR datasets

Table 6 Comparison of the performance of cost-sensitive strategy and ITSC methods on real-life datasets

5.1 Deep learning for TSC

5.2 Data sampling and ITSC

5.3 Cost-sensitive strategy

5.4 Limit and future work

6. Conclusion

References

¹
A polish data challenge platform: https://knowledgepit.fedcsis.org/.

Table 3
Data-sampling methods

Table 4
Comparison of the performance of cost-sensitive strategy, data-sampling and ITSC methods on evaluation metric $F_{1}$ on UCR datasets

Table 6
Comparison of the performance of cost-sensitive strategy and ITSC methods on real-life datasets