Combating the class imbalance problem in sparse representation learning

Abstract

Recent studies have shown sparse representation learning is a potentially promising method in pattern classification, but very few focused on class imbalanced problems involved in its applications and practice. This problem is particularly important, since it causes suboptimal classification performances, especially when the cost of misclassifying a minority-class example is substantial. Unlike the prior test sample sparse representation on balanced data sets, which cannot reflect the data distribution in real applications, we proposed a novel sparse representation learning algorithm called Balanced Sparse Representation Classifier (BSRC), considering the contribution from heavily under-represented of minority classes. Our solution first estimates the contribution of training sample in each class, and then identifies the nearest neighbors with the largest contributions. After that, the test data is expressed based on linear combination of all the nearest samples. Finally, the decision has been made according to sum of contribution for each class. Moreover, we also present the kernel extension of the proposed classifier to deal with complex data. Experimental results also show that with the proposed learning approach, it is possible to design better method to tackle the class imbalance problem in sparse representation learning.

Keywords

Machine learning sparse representation class imbalance classification

1 Introduction

The class imbalance problem [1] has been an active area of research over the past several years, because it is common in machine learning tasks such as medical diagnosis [2] text classification [3] face recognition [4]. The traditional classification is difficult to handle the real-world data sets with imbalanced class, in which the training set of the majority class far surpassed the training set of the other minority class. Since the minority is heavily under-represented in comparison to the majority ones, the classifier tends to misclassify the minority as the majority to obtain higher accuracy. It has been an increasingly important issue as many problems in machine learning and pattern recognition involve building adaptive classifier on different data distributions. Existing approaches to solving the class imbalance problem mainly include data level methods and algorithmic level methods. We focus on binary classification only and study the improved sparse representation methods at the algorithmic level.

Here, we investigate the test sample sparse representation method addressing the class imbalance problem, which is common in real applications. This problem is particularly important, since it causes suboptimal classification performances, especially when the cost of misclassifying a minority-class example is substantial. In this letter, a new balanced classifier based on test sample sparse representation, called Balanced Sparse Representation Classifier (BSRC), is proposed to build classifier with applicability and flexibility as follows.

Different from the previous method, BSRC represents test sample based on subset of training data for each class, which is less susceptible to under-representation due to the small sample size.

Combing merit of the kernel method and sparse representation in the new algorithm, the proposed method also inherits all the advantages of kernel technology. This improvement suits real-world data, no matter whether the data set is balanced or not.

After the related work section, the BSRC and the kernelized extension are described in detail.

2 Related work

In this section, we briefly introduce some related works, including solutions to class imbalance problem and sparse representation Learning.

2.1 Learning on class imbalance dataset

The main solutions to class imbalance problem include data level, algorithm level, integration, and ensemble learning methods.

Data level. The main idea at the data level is to change the distribution of the original data sets by increasing the number of class samples or reducing the majority of class samples. According to the different balance strategies, they are divided into over sampling method and under sampling method. Over sampling method is to increase the minority class samples, including the sample generation technique [5], security samples generation technique [6], samples near the border generation technique [7], adaptive synthetic samples generation technique [8], hierarchical clustering samples generation technique [9] and so on. The under sampling method is to abandon part of majority class samples including random under sampling [10], under sampling based on genetic algorithm [11], dynamic under sampling [12] and so on.

Algorithm level. The main idea of the algorithm level method is to reduce the loss of information of minority classes through modifying the existing classifier. And then the classifier with the adjust classification boundary is constructed to reduce the bias to majority classes. Many studies have improved the traditional classification algorithms, such as decision tree based on class confidence proportion [13], kNN algorithm based on class confidence [14], SVM algorithm based on swarm optimization [15] and so on. In addition, there are some cost sensitive methods, which consider the sample misclassification cost. These methods can assign different costs ratios for misclassified cases, i.e. more expensive to misclassify an actual minority class sample into majority class, such as cost sensitive SVM [16], cost sensitive KNN algorithm [17], cost sensitive contrast pattern-based classifiers [18]. But the biggest obstacle of these methods is that many data sets do not give cost sensitive matrices [19].

Integration method. These methods combine the advantages of data level and algorithmic level methods on the basis of reducing the shortcomings [20]. The most popular approach is combining data level solution with ensemble classifier to yield efficient learning methods [21]. In addition, a lot of research has been devoted to the combination of sampling methods and cost sensitive methods [22]. At the same time, the integration method also brings together the shortcomings of the combined methods, so that the analysis of the relationship between the combined methods is needed.

Ensemble method. At present, the combination of sampling ideas and ensemble learning methods is widely used in dealing with class imbalance problem, including SMOTEBoost combining the over sampling with the Boosting method [23] and RUSBoost integrating the under sampling with the Boosting method [24]. The subsequent EasyEnsemble combines Bagging, random undersampling, and AdaBoost [25]. Recently, a new ensemble method transforms the imbalanced problem into multiple equilibrium problems, without changing the original data distribution [26]. However, such algorithms may suffer from overfitting problems and data loss problems.

Most recently, Ofek etc. [27] proposed a fast clustering-based under sampling method named Fast-CBUS for binary-class imbalance problems. Kuo etc. [28] proposed a new preprocess method based on particle swarm K-means optimization to deal with the class imbalance problem in prostate cancer prognosis. Roy etc. [29] combined dynamic selection and data preprocessing for choosing the best ensemble given a test instance.

2.2 Sparse Representation Learning

The sparse representation method is currently an active research subject and increasingly attracting attention in pattern classification [30 –32]. This method selects out of all the training data a few that most compactly represent the test data and hence naturally avoids the problem with under-fitting or over-fitting [30][33]. However, when the data sets are the combination of imbalanced data and small sample size, the lack of representative data presents a new challenge to the community [1].

The model of sparse representation is simple and easy to understand and operate, through extracting all information of each class. And then it obtains a correlation coefficient for classification. The more information the sparse representation can obtain from the original data sets, the more effective for classification. But the effectiveness is founded on the rich information of each class. The data sets are class imbalanced in most cases, result in limited information can be exploited from the minority class [1]. It is concerned with the performance of learning algorithms in the presence of under represented data and severe class distribution skews.

Motivated by idea of the algorithm level, we try to combat the class imbalance problem in the widely used sparse representation method, via exploiting the information from each class to fully represent the test data.

3 Combating the Class Imbalance Problem in Sparse Representation Learning

3.1 Theoretic Foundation for Sparse Representation

In sparse representation method, test sample y can be represented in terms of all of the training samples as [30]:

$y = XA$ (1) where y is a test sample, and X = [x₁, …, x_n], x₁, …, x_n are training samples; A = [α₁, …, α_n] ′, a_i is the ith weight vector for training samples x_i. Firstly, the residuals for each training samples are computed as follows. $e_{i} = | | y - α_{i} x_{i} | |^{2}$ (2) Then select the M training data for each class, according to the residuals. On these data, the representation Equation (1) is used again. Then the contribution for each class c are computed ${con}_{c} = | | y - \sum_{x \in c} α_{i} x | |^{2}$ (3)

Massive algorithms based on the sparse representation theory have been proposed to effectively tackle real world learning tasks [32], on the basis of assumption of sparse representation classification (SRC) method that the test sample can be sufficiently represented by samples from the training data. Recently, Xu et al. [34] proposed the Two-Phase Test Sample Sparse Representation (TPTSSR) method for face recognition. Their experimental results show this method can improve the classification accuracy on balanced data sets. However, they only exploited global information and tended to lose local information. Most recently, Liu et al. [35] proved that the local information is effective for classification, and proposed weighting method named WTPTSSR to extend the two phase method. However, the test data representation is suffered from the heavily under-represented of minority classes, without considering the the class imbalance problem in the first representation step.

These improvements are all proposed on some evenly-distributed data set. However, labeled data in the real world applications are class imbalanced in many cases, these strategies cannot be applied directly. In order to combat the class imbalance problem in sparse presentation, we developed Balanced Sparse Representation Classifier (BSRC) and proposed a kernel version BKSRC.

3.2 Balanced Sparse Representation Classifier

When applying sparse representation, the number of samples that represent the different classes is imbalanced. This problem is particularly important, since this imbalance causes suboptimal classification performance, especially when the cost of misclassifying a minority-class example is substantial. Motivated by this consideration, the test samples are represented for each class in BSRC. Let $X^{i} = [x_{1}^{i}, x_{2}^{i}, \dots, x_{n_{i}}^{i}]$ be a training data matrix from class ω_i, where n_i is the number of training samples from class ω_i, i = 1, …, c, and c is the number of classes. In the first phase, using sparse representation method, every test sample y can be expressed as:

$\begin{matrix} y = α_{11} x_{1}^{1} + α_{12} x_{2}^{1} + \dots + α_{1 n_{1}} x_{n_{1}}^{1} \\ y = α_{21} x_{1}^{2} + α_{22} x_{2}^{2} + \dots + α_{2 n_{2}} x_{n_{2}}^{2} \\ ⋮ \\ y = α_{c 1} x_{1}^{c} + α_{c 2} x_{2}^{c} + \dots + α_{{cn}_{c}} x_{n_{c}}^{c} \end{matrix}$ (4) i.e. y = XⁱA_i, where A_i = [α_i1, α_i2, …, α_{in
_i}] ′, of which elements are the coefficients. Note that, the test data y is expressed by the data from each class. This is different from TPTSR, in which the test data y is expressed by all the training data. If Xⁱ is a nonsingular matrix, A_i can be obtained as A_i = (Xⁱ) ^-1y; otherwise, A_i = (^(Xⁱ)′Xⁱ + μI) ^-1(Xⁱ)′y, where μ is a small positive constant and I is the identity matrix.

We can compute the contribution of j-th training sample in i-th class as $a_{ij} x_{j}^{i}$ . Then we use the deviation of the test data y between the contribution of this data, $e_{j}^{i} = | | y - α_{ij} x_{j}^{i} | |^{2},$ (5) to choose the K nearest data from each class. The smaller $e_{j}^{i}$ value the sample $x_{j}^{i}$ has, the larger contribution it will has. That is to say, we would like to choose the samples corresponding to small values of deviation.

For each class, we identify the K nearest neighbors, which have the K greatest contributions, denoted by ${\tilde{x}}_{1}^{i}, {\tilde{x}}_{2}^{i}, \dots, {\tilde{x}}_{k}^{i}$ . TPTSR can obtain good classification accuracy, when the data sets are balanced. When the data sets are imbalanced, the minority classes are under expressed. Then the test sample sparse representation cannot fully use the information from the minority classes in TPTSR. We fully use the information of training data from different classes, no matter the number of the data is large or small.

The second phase of BSRC is similar to that of the TPTSR. We combine all the neighbors from each class, i.e. $\tilde{X} = [{\tilde{x}}_{1}^{1}, \dots, {\tilde{x}}_{k}^{1}, {\tilde{x}}_{1}^{2},$ $\dots, {\tilde{x}}_{k}^{2}, {\tilde{x}}_{1}^{c}, \dots, {\tilde{x}}_{k}^{c}]$ . Then the test example can be expressed by linear equation as follows:

$\begin{matrix} y = & b_{11} {\tilde{x}}_{1}^{1} + \dots + b_{1 k} {\tilde{x}}_{k}^{1} + b_{21} {\tilde{x}}_{1}^{2} + \dots + b_{2 k} {\tilde{x}}_{k}^{2} + \\ \dots + b_{c 1} {\tilde{x}}_{1}^{c} + \dots + b_{ck} {\tilde{x}}_{k}^{c} \end{matrix}$ (6) Equation (6) can be re-written as:

$y = \tilde{X} B$ (7) where B = [b₁₁, …, b_1k, b₂₁, …, b_2k, …, b_c1, …, b_ck]^′. Solving the Equation (7), we obtain:

$B = {\begin{matrix} (\tilde{X})^{- 1} y, & \tilde{X} is nonsingular matrix; \\ ({\tilde{X}}^{'} \tilde{X} + γ I)^{- 1} {\tilde{X}}^{'} y, & otherwise . \end{matrix}$ (8) where γ is a positive constant and I is the identity matrix. Note that the $\tilde{X}$ in Equation (8) is different from that in TPTSR. $\tilde{X}$ may contain different numbers of vectors from different classes in TPTSR. Since the number of vectors of $\tilde{X}$ is the same from each class in BSRC, our method can relieve the loss caused by heavily under-represented minorities. After solving the linear equation, we can obtain the sum of contribution for each class as follows:

$S_{i} = b_{i 1} {\tilde{x}}_{1}^{i} + \dots + b_{ik} {\tilde{x}}_{k}^{i}, i = 1, \dots, c .$ (9) Then, we can classify the test data y as:

$H (y) = arg min_{i} | | y - S_{i} | |, i = 1, \dots, c .$ (10) In BSRC, a test sample y is classified as Algorithm 1.

Algorithm 1

Balanced Sparse Representation Classifier

Require:

The set of training samples, X;

The set of test samples, Y;

Ensure:

1: Use linear combination of the training samples to express the test data y for each class;

2: Solve the linear Equation (4) to get the contribution of every sample;

3: Choose the K nearest samples corresponding to K greatest contributions for each class;

4: Use linear combination of all the nearest samples to express y;

5: Solve the linear Equation (7) to get the contribution of every nearest sample;

6: Compute the sum of the contribution of each class, using Equation (9);

7: Classify y using Equation (10);

4 Balanced Kernel Sparse Representation

Kernel method provides nonlinear model analysis to increase the computational power of linear classifier. Transforming the original feature space into the kernel space, the test data y can be expressed using transformed training data φ (x). In the transformed feature space, the testing data φ (y) can be expressed as:

$φ (y) = {X_{i}}_{φ} A_{i}$ (11) where X_i is the subset of the training samples which belong to ith class. Then $⋃_{i = 1}^{c} X_{i} = X$ , and X_i ∩ X_j = Ø (i, j = 1, 2 … k, i ≠ j), where k is the number of classes. A_i is the sparse vector, which is composed of α_ij in the ith class.

Like BSRC, we also compute the deviation of the test data y between the contribution of this data, $e_{j}^{i} = | | φ (y) - α_{ij} φ (x_{j}^{i}) | |^{2}$ (12)

We select K-nearest neighbors corresponding to the K smallest values of $e_{j}^{i}$ in class i. Combining all the neighbors in every class, we get the balanced training set $\tilde{X}$ . On these data set, φ (y) is represented as ${\tilde{X}}_{φ} \tilde{A}$ . Then, the Equation (10) can be rewritten as $H (y) = arg min_{i} e_{i} (y)$ (13) $\begin{matrix} = & arg min_{i} ∥ φ (y) - {\tilde{X_{i}}}_{φ} {\tilde{A}}_{i} ∥ \\ = & arg min_{i} 〈 φ (y) - {\tilde{X_{i}}}_{φ} {\tilde{A}}_{i}, φ (y) - {\tilde{X_{i}}}_{φ} {\tilde{A}}_{i} 〉^{\frac{1}{2}} \\ = & arg min_{i} (κ (y, y) - 2 {\tilde{A}}_{i} κ_{{\tilde{X}}_{i}, y} + {\tilde{A}}_{i}^{'} (K_{{\tilde{X}}_{i}}) {\tilde{A}}_{i})^{\frac{1}{2}} \end{matrix}$ where κ (y, y) is the kernel function¹k, [1]κ, defined as 〈φ (y) , φ (y) 〉, $κ_{{\tilde{X}}_{i}, y} = κ (y, {\tilde{X}}_{i}) = 〈 φ (y), φ ({\tilde{X}}_{i}) 〉$ , $K_{{\tilde{X}}_{i}} = 〈 φ ({\tilde{X}}_{i}), φ ({\tilde{X}}_{i}) 〉$ . By using the kernel method, the algorithm 1 can be rewritten as Algorithm 2. As shown int Algorithm 1 and Algorithm 2, the procedures are easy to implement. Both of them can be scalable by strategy of selection of the appropriate numbers of neighbors in the real application, such as cross-validation [36] [37] and adaptive search [38].

Algorithm 2

Balanced Kernel Sparse Representation Classifier

Require:

The set of training samples, X;

The set of test samples, Y;

Ensure:

1: Compute the kernel matrices for X, Y;

2: Express the test data y for each class according to Equation (11);

3: Get the contribution

α_{ï j} ϕ (x_{j}^{i})

of every sample in ith class;

4: Select K nearest samples for each class, using Equation (12);

5: Combine all the nearest samples to express

φ (y) = {\tilde{X}}_{φ} \tilde{A};

6: Classify y using Equation (13);

[1]Kernel function (φ (x)) is introduced to reduce computation, for mapping the data nonlinearly into a feature space.

5 Experiments

In this section we evaluate BSRC and BKSRC algorithm empirically. Firstly, we conduct experiment on visual data sets and non-visual data sets with different imbalance rate. And then, we also investigate the performance of both method on large data sets. As we will show later, BSRC and BKSRC improves the performance over the state-of-the-art methods on class imbalance data sets.

5.1 Experiment on Visual Data Sets

The experimental data sets are three baseline face data sets: YaleB [39], PIE [40], and AR [41].

The Yale Face Database B Contains 5760 single light source images derived from 10 subjects of each under 576 viewing conditions. Every subject has been captured in a particular pose with ambient illumination.

The PIE face database of CMU contains 41,368 images derived from 68 distinct subjects of each under 13 different poses, 43 different illumination conditions, and with 4 different expressions.

The AR face database includes 126 distinct subjects of each providing 26 color images, with different facial expressions, illumination conditions, and occlusions.

Note that, target subject is used as minority class, and all other subjects are used as majority class. For example, 567 samples of one person in YaleB data are used as the samples of minority class, and the left (5760-567=5193) samples of 9 persons are used as the samples of majority class. And the target class is random select. Note that, without loss of generality, 10 × 3-fold cross validations have been conducted. The value of K is determined by the cross-validation method [37].

In order to investigate the performance of BSRC and BKSRC (Gaussian kernel K (x, y) = exp (- ||x - y||²) and Both γ and μ were set to 0.01 here), we compare it with KNN [42], Resampling [43], TPTSSR [34], WTPTSSR [35], Fast-CBUS [27], psK-means [28], and Ba-RM [29]. We compare the performances using the area under the receiver operating characteristic curve (AUC) value[44], which is a good singular metric when dealing with the class imbalanced problem in various application [44][45][46]. Our methods can achieve very good performances when suitable numbers of nearest neighbors are used to represent the test samples.

Fig.1

Performances compared on YaleB.

Fig.2

Performances compared on PIE.

Fig.3

Performances compared on AR.

From Fig. 1, we can see that BSRC and BKSRC are superior to other algorithms on YaleB data sets with the increasing numbers of nearest neighbors, except for the cases when the numbers are smaller than 125. But even in these cases, the performances of the BSRC are comparable to TPTSSR, WTPTSSR, Fast-CBUS, psK-means, and Ba-RM.

Fig. 2 clearly shows that our methods are always able to obtain much higher AUC values than all the other sparse representation algorithms and specific unbalance methods on PIE data set. Both of the figures show that our algorithms can improve the AUC performance, by providing suitable numbers of nearest neighbors. That is to say, the quality of the representation of test data not only depends on algorithms, but also depends on the representation and information of the absolute number of each class.

For the AR data, BSRC and BKSRC are superior to other algorithms after the numbers of nearest neighbors at hundred as shown in Fig. 3. These different improvements may be caused by a variety of reasons, such as the quality of different face data, the number of each class, the representation ability of each class, and so on. Note that we can also believe our methods have good performances, since the numbers of samples per class in the AR is less than the YaleB and PIE.

Table 1

Information of KEEL data sets

data	#size	featrue	Class (min., maj.)	minority class	imbalance rate
Data-sets with low imbalance rate (1.5-9 imbalance rate)
Glass2	214	9	(build windownon_floatproc, remainder)	(35.51, 64.49)	1.82
EcoliCP-IM	220	7	(im, cp)	(35.00, 65.00)	1.86
Wisconsin	683	9	(malignant, benign)	(35.00, 65.00)	1.86
Pima	768	8	(testedpositive, testednegative)	(34.84, 66.16)	1.9
Iris1	150	4	(IrisSetosa, remainder)	(33.33, 66.67)	2
Glass1	214	9	(buildwindowfloatproc, remainder)	(32.71, 67.29)	2.06
Yeast2	1484	8	(NUC, remainder)	(28.91, 71.09)	2.46
Vehicle2	846	18	(Saab, remainder)	(28.37, 71.63)	2.52
Vehicle3	846	18	(bus, remainder)	(28.37, 71.63)	2.52
Vehicle4	846	18	(Opel, remainder)	(28.37, 71.63)	2.52
Haberman	306	3	(Die, Survive)	(27.42, 73.58)	2.68
Data-sets with medium imbalance rate (3-9 imbalance rate)
GlassNW	214	9	(non-window glass,remainder)	(23.83, 76.17)	3.19
Vehicle1	846	18	(van, remainder)	(23.64, 76.36)	3.23
Ecoli2	336	7	(im, remainder)	(22.92, 77.08)	3.36
New-thyroid3	215	5	(hypo, remainder)	(16.89, 83.11)	4.92
New-thyroid2	215	5	(hyper, remainder)	(16.28, 83.72)	5.14
Ecoli3	336	7	(pp, remainder)	(15.48, 84.52)	5.46
Segment1	2308	19	(brickface, remainder)	(14.26, 85.74)	6.01
Glass7	214	9	(headlamps, remainder)	(13.55, 86.45)	6.38
Yeast4	1484	8	(ME3, remainder)	(10.98, 89.02)	8.11
Ecoli4	336	7	(iMU, remainder)	(10.88, 89.12)	8.19
Page-blocks	5472	10	(remainder, text)	(10.23, 89.77)	8.77
Data-sets with high imbalance rate(higher than 9 imbalance rate)
Vowel0	988	13	(hid, remainder)	(9.01, 90.99)	10.1
Glass3	214	9	(Ve-win-float-proc, remainder)	(8.78, 91.22)	10.39
Ecoli5	336	7	(om, remainder)	(6.74, 93.26)	13.84
Glass5	214	9	(containers, remainder)	(6.07, 93.93)	15.47
Abalone9-18	731	8	(18, 9)	(5.65, 94.25)	16.68
Glass6	214	9	(tableware, remainder)	(4.20, 95.80)	22.81
YeastCYT-POX	482	8	(POX, CYT)	(4.15, 95.85)	23.1
Yeast5	1484	8	(ME2, remainder)	(3.43, 96.57)	28.41
Yeast6	1484	8	(ME1, remainder)	(2.96, 97.04)	32.78
Yeast7	1484	8	(EXC, remainder)	(2.49, 97.51)	39.16
Abalone19	4174	8	(19, remainder)	(0.77, 99.23)	128.87

5.2 Experiment on Non-visual Data Sets with Different Imbalance Rates

In order to investigate the performance of the BSRC method, we also conduct experiment on the non-visual data sets from KEEL data set repository [47], which is used as baseline data for class imbalance problem research, as shown in Table 1.

Table 2
AUC performance comparisons for each data set with low imbalance rate

data set KNN Resampling TPTSSR WTPTSSR FastCBUS psKmeans BaRM BSRC BKSRC

Glass2 0.68 0.63 0.65 0.66 0.67 0.63 0.69 0.66 0.70

EcoliCP-IM 0.67 0.70 0.62 0.67 0.69 0.63 0.68 0.71 0.69

Wisconsin 0.67 0.71 0.61 0.65 0.72 0.65 0.65 0.69 0.72

Pima 0.65 0.68 0.63 0.59 0.67 0.69 0.64 0.65 0.68

Iris1 0.62 0.69 0.64 0.63 0.68 0.67 0.60 0.67 0.63

Glass1 0.63 0.62 0.65 0.66 0.64 0.69 0.63 0.65 0.69

Yeast2 0.68 0.70 0.67 0.70 0.69 0.73 0.72 0.73 0.75

Vehicle2 0.59 0.65 0.63 0.69 0.68 0.68 0.78 0.69 0.79

Vehicle3 0.58 0.60 0.65 0.70 0.73 0.69 0.73 0.75 0.80

Vehicle4 0.60 0.65 0.63 0.68 0.63 0.70 0.68 0.69 0.71

Haberman 0.65 0.70 0.67 0.64 0.69 0.71 0.69 0.68 0.70

BSRC→w/t/l 9/1/1 7/0/4 10/1/0 8/2/1 6/0/5 6/2/3 8/0/3 – 2/0/9

BKSRC→w/t/l 11/0/0 7/2/2 10/0/1 10/1/0 8/2/1 7/1/3 11/0/0 9/0/2 –

data set	KNN	Resampling	TPTSSR	WTPTSSR	FastCBUS	psKmeans	BaRM	BSRC	BKSRC
Glass2	0.68	0.63	0.65	0.66	0.67	0.63	0.69	0.66	0.70
EcoliCP-IM	0.67	0.70	0.62	0.67	0.69	0.63	0.68	0.71	0.69
Wisconsin	0.67	0.71	0.61	0.65	0.72	0.65	0.65	0.69	0.72
Pima	0.65	0.68	0.63	0.59	0.67	0.69	0.64	0.65	0.68
Iris1	0.62	0.69	0.64	0.63	0.68	0.67	0.60	0.67	0.63
Glass1	0.63	0.62	0.65	0.66	0.64	0.69	0.63	0.65	0.69
Yeast2	0.68	0.70	0.67	0.70	0.69	0.73	0.72	0.73	0.75
Vehicle2	0.59	0.65	0.63	0.69	0.68	0.68	0.78	0.69	0.79
Vehicle3	0.58	0.60	0.65	0.70	0.73	0.69	0.73	0.75	0.80
Vehicle4	0.60	0.65	0.63	0.68	0.63	0.70	0.68	0.69	0.71
Haberman	0.65	0.70	0.67	0.64	0.69	0.71	0.69	0.68	0.70
BSRC→w/t/l	9/1/1	7/0/4	10/1/0	8/2/1	6/0/5	6/2/3	8/0/3	–	2/0/9
BKSRC→w/t/l	11/0/0	7/2/2	10/0/1	10/1/0	8/2/1	7/1/3	11/0/0	9/0/2	–

Table 3

AUC performance comparisons for each data set with medium imbalance rate

data set	KNN	Resampling	TPTSSR	WTPTSSR	FastCBUS	psKmeans	BaRM	BSRC	BKSRC
GlassNW	0.68	0.68	0.64	0.62	0.63	0.68	0.73	0.69	0.70
Vehicle1	0.63	0.60	0.63	0.65	0.65	0.62	0.66	0.70	0.73
Ecoli2	0.64	0.70	0.62	0.64	0.65	0.66	0.64	0.68	0.71
New-thyroid3	0.63	0.65	0.65	0.69	0.71	0.66	0.65	0.69	0.70
New-thyroid2	0.61	0.61	0.63	0.62	0.64	0.69	0.66	0.68	0.69
Ecoli3	0.58	0.65	0.66	0.68	0.69	0.69	0.70	0.72	0.68
Segment1	0.59	0.66	0.64	0.65	0.73	0.72	0.70	0.72	0.77
Glass7	0.61	0.63	0.64	0.67	0.69	0.70	0.73	0.69	0.74
Yeast4	0.65	0.66	0.69	0.72	0.75	0.72	0.74	0.71	0.76
Ecoli4	0.65	0.62	0.63	0.64	0.67	0.71	0.73	0.71	0.70
Page-blocks	0.65	0.69	0.73	0.75	0.69	0.70	0.74	0.77	0.75
BSRC→w/t/l	11/0/0	10/0/1	11/0/0	9/1/1	7/1/3	6/2/3	7/0/4	–	3/0/8
BKSRC→w/t/l	11/0/0	10/0/1	11/0/0	9/2/0	9/0/2	8/1/2	8/0/3	8/0/3	–

Table 4

AUC performance comparisons for each data set with high imbalance rate

data set	KNN	Resampling	TPTSSR	WTPTSSR	FastCBUS	psKmeans	BaRM	BSRC	BKSRC
Vowel0	0.63	0.65	0.67	0.66	0.69	0.73	0.72	0.70	0.74
Glass3	0.60	0.69	0.62	0.65	0.71	0.71	0.70	0.72	0.73
Ecoli5	0.72	0.65	0.66	0.68	0.69	0.70	0.69	0.70	0.71
Glass5	0.67	0.68	0.67	0.69	0.71	0.69	0.70	0.73	0.72
Abalone9-18	0.59	0.65	0.62	0.69	0.63	0.69	0.70	0.75	0.77
Glass6	0.59	0.69	0.64	0.65	0.68	0.71	0.75	0.75	0.73
YeastCYT-POX	0.70	0.72	0.73	0.71	0.69	0.68	0.70	0.76	0.69
Yeast5	0.65	0.72	0.73	0.69	0.78	0.70	0.74	0.75	0.78
Yeast6	0.66	0.74	0.73	0.64	0.69	0.71	0.71	0.72	0.80
Yeast7	0.69	0.70	0.73	0.74	0.69	0.72	0.73	0.79	0.73
Abalone19	0.55	0.69	0.65	0.66	0.73	0.71	0.72	0.73	0.73
BSRC→w/t/l	10/0/1	10/0/1	10/0/1	11/0/0	9/1/1	9/1/1	9/1/1	–	4/1/6
BKSRC→w/t/l	9/0/2	10/0/1	9/1/1	9/0/2	8/3/0	11/0/0	8/1/2	6/1/4	–

On all the data sets with different imbalance Rate, BKSRC is the best among the nine methods compared. The line w/t/l summarized at the bottom of the table, means that TNB wins in w data sets, ties in t data sets, and loses in l data sets, compared with the algorithm at the corresponding column. We summarize the highlights as follows.

From Tables 2, 3 and 4. we can see that the BSRC and BKSRC win more than 6 out of 11 cases. Both two methods have the best performances on 8 data sets with low imbalance rate as shown in Table 2. The original sparse representation method, TPTSSR and WTPTSSR, only win in 1 case respectively. And BKSRC has the best performances on more than 6 data sets. BKSRC has better performances on more than 8 data sets with medium imbalance rate, as shown in Table 3. Compared with all the nine methods, BSRC has better performances on more than 6 data sets except for BKSRC. From Table 4, we can see that the BSRC method only loses 1 case compared with all the method except for BKSRC, while BKSRC method loses 2 cases compared with all the method except for BSRC.

These different improvements may be caused by a variety of reasons, such as the size of samples, number of features, rate of imbalance, and so on. It demonstrates that BSRC and BKSRC yield the best performance, due to the more suitable information representation for imbalanced data classification.

5.3 Experiment on large Data Sets

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset. They were collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton [48].

The CIFAR-10 The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class.

The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

Fig.4

Performances compared on CIFAR-10.

Fig.5

Performances compared on CIFAR-100.

In order to investigate the scalability of the proposed method, both of the data sets contain large number of samples. Here, we also give a brief computational complexity of solving the equation. The global sparse representation matrix can be obtained by the solution to Equation (4). However, the computational complexity of the sparse representation point is found from the whole sample set, and the computational complexity of the sparse representation of a sample is O (t² (N - 1)), in which the number of none zero weights is t, the number of the samples in the subset is (N - 1). Since the number of samples in this set is much less than that in real applications, the computation complexity is approximate to O ((N - 1) ³).

Compared with the global sparse representation matrix, It is obvious that because only need to find sparse representation points in the neighborhood in Equation (8) rather than in the whole data set, the local sparse representation method obtains the theoretical computing complexity of a sample O (t²K_max), where t ⪡ K_max ⪡ (N - 1). This step has far less computational complexity than global sparse representation.

From the experiments, we can see the performances of BSRC and BKSRC are better than other methods, as shown in Figs. 4 and 5. It means the proposed method is still an optimistic extension of the sparse representation on large data sets.

6 Conclusion

Proposed is new sparse representation learning algorithm, BSRC, for imbalanced data classification. Experiments on popular data sets demonstrate the promising performances of our method. This is attributed to BSRC having two prominent characteristics. Firstly, it inherits the merit of sparse representation, which can avoid the problems with under-fitting or over-fitting. Secondly, since it is designed to relieve the loss caused by class imbalance, it improves the performance of the classifier on class imbalanced data. Moreover, we extend the proposed classifier to a kernelized version (BKSRC) with the kernel trick. This improvement is more suitable for real-world data, no matter whether the data set is balanced or not.

In the future, we will study how to combine other sparse representation learning methods with the sample level and algorithm level methods, and investigate the influence of class imbalance in classification on more data sets. This is an interesting issue to be explored, which might shed light on the design of more powerful classifier algorithms.

Footnotes

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 61502404, 61573297, 61672442), Natural Science Foundation of Fujian Province of China (Grant Nos. 2016J01326, 2016Y0079), Distinguished Young Scholars Foundation of Fujian Educational Committee (Grant No. DYS201707), International S&T Cooperation Program (Grant No. E201402000). We thank the anonymous reviewers for their great helpful comments.

References

and Garcia

E.A.

, Learning from unbalanced data, IEEE Transactions on Knowledge and Data Engineering 2 (2009),1263–1284.

Kuo

R.J.

, Lin

, Zulvia

F.E.

and Lin

C.C.

, Integration of cluster analysis and granular computing for imbalanced data classification: A case study on prostate cancer prognosis in taiwan, Journal of Intelligent and Fuzzy Systems 32 (2017),2251–2267.

Song

, Huang

, Qin

and Song

, A bi-directional sampling based on K-means method for imbalance text classification, Inpp, IEEE/ACIS ICCIS? 16 (2016),1–5.

Alejo

, Valdovinos

R.M.

and Garcĺła

, A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios, Pattern Recognition Letters 34 (2013),380–388.

Chawla

, Bowyer

, Hall

and Kegelmeyer

, SMOTE: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (2002),321–357.

Bunkhumpornpat

, Sinapiromsaran

and Lursinsap

, Safe-level-smote: Safe-level-synthetic minority over sampling technique for handling the class imbalanced problem, In pp, PACKDD’09 (2009),475–482.

Han

, Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: A new over-sampling method in imbalanceddata sets learning, In pp, ICIC ’05 (2005),878–887.

, Bai

, Garcia

E.A.

and Li

, ADASYN: Adaptive synthetic sampling approach for imbalanced learning, In pp, IJCNN’08 (2008),1322–1328.

Nekooeimehr

, Lai-Yuen

S.K.

,Adaptive iunsupervised weighted oversampling (A-SUWO) for imbalanced datasets, Expert Systems with Applications 46 (2016),405–416.

10.

Batista

G.E.

, Prati

R.C.

and Monard

M.C.

, A study of the behavior of several methods for balancing machine learning training data, ACM Sigkdd Explorations Newsletter 6 (2004),20–29.

11.

and Lee

J.S.

, A New Under-Sampling Method Using Genetic Algorithm for Imbalanced Data Classification, In pp, ICUIMC’16 (2016),95.

12.

Fan

, Wang

and Gao

, One-sided dynamic undersampling no-propagation neural networks for imbalance problem, Engineering Applications of Artificial Intelligence 53 (2016),62–73.

13.

Liu

, Chawla

, Cieslak

D.A.

and Chawla

N.V.

, A Robust Decision Tree Algorithm for Imbalanced Data Sets, In pp, SDM’10 (2010),766–777.

14.

Liu

and Chawla

, Class confidence weighted kNN Algorithms for imbalanced data sets, In pp, PAKDD’11 (2011),345–356.

15.

, Fong

, Mohammed

and Fiaidhi

, Improving the classification performance of biological imbalanced datasets by swarm optimization algorithms, The Journal of Supercomputing 72 (2015),1–21.

16.

Lee

and Wahba

, Multicategory support vector machinesčňTheory and application to the classification of microarray data and satellite radiance data, Journal of the American Statistical Association 99 (2003),67–81.

17.

Karagiannopoulos

M.G.

, Anyfantis

D.S.

, Kotsiantis

S.B.

and Pintelas

P.E.

, Local cost sensitive learning for handling imbalanced data sets, In pp, MED’07 (2007),1–6.

18.

Loyola-Gonzĺćlez

, Medina-Pĺkerez

M.A.

, Martĺłnez-Trinidad

J.F.

, Carrasco-Ochoa

J.A.

, Monroy

and Garcĺła-Borroto

, Pbc4cip: A new contrast pattern-based classifier for class imbalance problems, Knowledge-Based Systems 115 (2017),100–109.

19.

Min

and Zhu

W.A.

, Competition strategy to cost sensitive decision trees, RSKT’12 (2012),359–368.

20.

Wozniak

, Hybrid Classifiers: Methods of Data, Knowledge, and Classifier Combination, Springer Publishing Company, 2013.

21.

Wozniak

, Grana

and Corchado

, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014),3–17.

22.

Cao

and Wang

S.Z.

, Applying over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning, In pp, IJCNN ’11 (2011),543–548.

23.

Chawla

N.V.

, Lazarevic

, Hall

L.O.

and Bowyer

K.W.

, SMOTEBoost: Improving prediction of the minority class in boosting, In pp, PKDD’ 03 (2003),107–119.

24.

Seiffert

, Khoshgoftaar

T.M.

and Hulse

J.V.

, Improving software-quality predictions with data sampling and boosting, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39 (2009),1283–1294.

25.

Liu

, Wu

and Zhou

, Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 39 (2009),539–550.

26.

Sun

, Song

, Zhu

X.H.S.

, Xu

and Zhou

, A novel ensemble method for classifying imbalanced data, Pattern Recognition 48 (2015),1623–1637.

27.

Ofek

, Rokach

, Stern

and Shabtai

, Fast-CBUS: A fast clustering-based under sampling method for addressing the class imbalance problem, Neurocomputing 243 (2017),88–102.

28.

Kuo

R.J.

, Lin

, Zulvia

F.E.

and Lin

C.C.

29.

Roy

, Cruz

R.M.O.

, Sabourin

and Cavalcanti

G.D.C.

, A study on combining dynamic selection and data preprocessing for imbalance learning, Neurocomputing 286 (2018),179–192.

30.

Wright

, Ma

, Mairal

, Sapiro

, Huang

T.S.

and Yan

, Sparse representation for computer vision and pattern recognition, Proceedings of the IEEE 98 (2010),1031–1044.

31.

Wang

and Su

, Registration Method of Sparse Representation Classification Method, IEICE Transactions on Information and Systems E95.D (2012),1332–1335.

32.

Zhang

, Xu

, Yang

, Li

and Zhang

, A survey of sparse representation: Algorithms and applications, IEEE Access 3 (2015),490–530.

33.

Yang

, Wright

, Ma

and Sastry

S.S.

, Feature selection in face recognition: A sparse representation perspective, UC Berkeley Tech Report UCB/EECS–99, (2007),2007.

34.

, Zhang

, Yang

and Yang

J.Y.

, A two-phase test sample sparse representation method for use with face recognition, IEEE Transactions on Circuits and Systems for Video Technology 21 (2011),1255–1262.

35.

Liu

, Pu

, Xu

and Qiu

, Face Recognition Via Weighted Two Phase Test Sample Sparse Representation, Neural Processing Letters 41 (2015),43–53.

36.

, Chen

Y.W.

and Chen

Y.Q.

, The nearest neighbor algorithm of local probability centers, IEEE Transactions on Systems, Man, and Cybernetics, Part B 38 (2008),141–154.

37.

Celisse

and Maryhuard

, Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier,arXiv:, Statistics (1508),04905.

38.

, Qin

and Yu

, An adaptive k-nearest neighbor text categorization strategy, Acm Transactions on Asian Language Information Processing 3 (2004),215–226.

39.

[Online]. Available: http://vision.ucsd.edu/ leekc/ExtYaleData/base/ExtYaleB.html

40.

[Online]. Available: http://vasc.ri.cmu.edu/idb/html/face/

41.

[Online]. Available: http://cobweb.ecn.purdue.edu/ aleix/ aleix-face-DB.html

42.

Bottou

and Vapnik

, Local learning algorithms, Neural Computation 4 (1992),888–900.

43.

M.H.

and Chen

S.Y.

, Resampling Methods for Solving Class Imbalance Problem in Traffic Incident Detection, Applied Mechanics and Materials 744-746 (2015),1985–1989.

44.

Huang

and Ling

C.X.

, Using AUC and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering 17 (2005),299–310.

45.

, Luo

, Zeng

and Chen

, Transfer Learning for Cross-company Software Defect Prediction, Information and Software Technology 54 (2012),248–256.

46.

Huang

Y.M.

, Hung

C.M.

and Jiau

H.C.

, Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem, Nonlinear Analysis Real World Applications 7 (2006),720–747.

47.

http://sci2s.ugr.es/keel/imbalanced.php.

48.

http://www.cs.toronto.edu/ kriz/cifar.html.