Abstract
The classification problem of imbalanced datasets has received much attention in recent years. This imbalance problem usually occurs in when the ratio between classes is high. Many techniques have been developed to tackle the imbalance problem in supervised learning. The Synthetic Minority Over-sampling Technique (SMOTE) is one of the most effective over-sampling methods processing this problem, which changes the distribution of training sets to balance the different number of examples of each class. However, SMOTE randomly synthesizes the minority instances along a line joining a minority instance and its selected nearest neighbors, ignoring nearby majority instances and isolated points, which would affect the final classification result. In this paper, we propose two improved techniques based on SMOTE through sparse representation theory. This extension results in Sparse-SMOTE and SROT (Sparse Representation Based Over-Sampling Technique). The Sparse-SMOTE replaces the k-nearest neighbors of the SMOTE with sparse representation, and the SROT uses a sparse dictionary to create a synthetic sample directly. The experiments are performed on 10 UCI datasets using C4.5 as the learning algorithm. The experimental results show that both proposed methods can achieve better performance on TP-Rate, F-Measure, G-Mean and AUC values. Moreover, the results show that our new proposals’ perform is more effective compared with SMOTE and some other approaches.
Introduction
In real life, samples of datasets are usually imbalanced, meaning that some classes have fewer samples than others. Most of existing data is imbalanced, while specific data satisfying balance requirements is difficult to acquire, and is, therefore, very scarce. The Imbalance Ration (IR), defined by the number of the majority class divided by the number of the minority class, expresses to which extent a dataset is imbalanced: a dataset with IR equal to 1 is perfectly balanced, the higher the IR, the more imbalanced the dataset [1]. Imbalanced datasets is very common in network intrusion, medical diagnosis, text classification, and other practical applications [2]. Many kinds of situations can be converted to binary classification problems, even though most of datasets have multi-class attributes [4], so this work focuses on binary classification problems (the minority versus the majority class) with examples randomly and uniformly distributed in the two-dimensional real-value space.
At present, effective solutions of this class imbalanced problem can take three types of strategies: data pre-processing, algorithm and prediction post-processing [5]. Among the three strategies, the algorithm based on data side is more popular due to its independence and adaptability. The SMOTE algorithm, as one of the most effective over-sampling methods processing imbalanced problem, has a powerful influence. SMOTE algorithm generates synthetic minority examples to over-sample the minority class. For every minority example, its k nearest neighbors of the same class are calculated, then some examples are randomly selected from them according to the over-sampling rate [6]. After that, new synthetic examples are generated along the line between the minority example and its selected nearest neighbors. But inappropriate k values lead to the incorporation of the minority of samples into the majority of samples, resulting in noise that hinders the classification [7, 8]. Additionally, SMOTE algorithm is only interpolated between minority class samples and does not fundamentally change the sparsity of the sample distribution. The result of the interpolation is that the minority class sample of the dense place is still relatively dense, the minority class sample of the sparse place is still relatively sparse [9, 10]. In addition, SMOTE does not take the isolated points and noise into account. In this way, in the sparse region of the sample, the classification algorithm is not easy to identify, prone to misclassification. Thus, the classification effect of SMOTE on some sparse samples is not obvious [11].
Accordingly, we propose two improved algorithms to attempt to solve imbalance problem: the Sparse-SMOTE and SROT (Sparse Representation-Based Over-Sampling Technique) in this paper. Our research is inspired by the widely used SMOTE algorithm, as well as the sparse representation theory. A significant research task is to use sparse representation instead of k nearest neighbors based on the framework of Compressive Sensing (CS). The methods based on sparse representation extract samples from all datasets and obtains the correlation coefficients. Then, the correlation coefficient is used to classify the datasets. Theoretically, the sparse representation method is not limited by dataset and it has been proved to solve a similar but constrained problem well [12, 13]. First, we use minority class samples to construct a sparse dictionary for obtaining the sparse solution of current points by solving
The Sparse-SMOTE algorithm uses the nonzero sparse solution instead of the k-nearest neighbor algorithm used in the SMOTE algorithm, but SROT modifies the sparse solution using some disturbance, and obtains new samples using the sparse dictionary. Experimental results show that our algorithms can effectively improve the classifier to distinguish imbalanced datasets.
The structure of this paper is organized as follows. Section 2 briefly describes related works for handling the class imbalanced problem. Section 3 describes our improved over-sampling methods on resolving the imbalanced problem. Section 4 presents the experiments and result. Section 5 draws the conclusion and future work.
Related works
The strategy based on data pre-processing is mainly to solve the imbalance problem by changing the size of such imbalance training datasets under the four criteria proposed by Breiman [14], namely, to resample the datasets to enforce the majority and minority classes back to balance. Resampling techniques include under-sampling and over-sampling techniques. The Random under-sampling technique selects a subset of the majority class in a certain proportion, and removes it from the original dataset to reduce the majority class in the original data. The Random over-sampling technique copies minority class samples into the minority class in order to increase the number of the minority class. With the further research, there has been several important approaches to the problem of imbalanced domains and different results reported when using over-sampling strategies with decision trees to address the problem of imbalanced domains [15]. For instance, in [16] over-sampling is reported to be ineffective, while in other studies (e.g. [17]) over-sampling is reported to have advantages when considering under-sampling strategies. Thus, random under-sampling, redux and introduction of Gaussian Noise are among the strategies that have been discussed. The SMOTE (Synthetic Minority Over-Sampling Technique) is the most representative over- sampling technique, proposed by Chawla et al. in 2002 [6]. Moreover, although SMOTE is a strategy developed for generating synthetic examples of the minority class, this strategy was combined with random under-sampling in the paper where it was proposed. Besides, the SMOTE contains many other modified algorithms [11], such as the Borderline-SMOTE [18], SMOTE-TL [19], SMOTE-D [20], SMOTE-IPF [21], etc. The Random over-sampling technique can improve the ability of classifiers because it can decrease the imbalance degree of the sample space. However, the disadvantages of Random sampling technique cannot be ignored for simply copying the original data as a large number of repeated samples, which makes the classifier decision area too small. This operation eventually leads to the overfitting phenomenon, thus seriously reduces the generalization performance of the algorithm.
The strategy based on the algorithm side improves existing machine learning algorithms, and attaches more importance on the minority class. It mainly focuses on improving the classification algorithms for minority of concern, including cost-sensitive learning [22], ensemble learning [23], single-class learning [24], etc. In the imbalance classification problem, the cost of different categories of misjudging is not equal. If the minority class were handled as the majority class by mistake, it would incur more serious costs. The strategy based on the prediction post-processing considers two main types of solutions: threshold method [25] and cost-sensitive post-processing [20]. Prediction post-processing approaches use the original dataset and a standard learning algorithm only manipulating the predictions of the models according to the user preferences and the imbalance of the data. As advantages, it is not necessary to be aware of the user preferences biases at learning time. But the models cannot reflect the user preferences [5].
Methodology
As mentioned before, applying a preprocess step in order to balance the class distribution is a positive solution to the imbalance problem. Specifically, we improve over-sampling techniques with sparse representation. In this section, first, the main details of Compressive Sensing theory and Sparse Representation theory are given in Section 2.1. Next, the motivational analysis of using sparse representation to improve the over-sampling technique is described in Section 2.2. Then, two implicated methods are described in depth: the Sparse-SMOTE algorithm in Section 2.3, and SROT algorithm in Section 2.4.
Compressive sensing and sparse representation
Compressed Sensing (also known as Compressive Sensing, Compressive Sampling, or Sparse Sampling) is a signal processing technique for efficient acquisition and reconstruction of a signal, by finding solutions of underdetermined linear systems, detailed in reference [26]. The Shannon/Nyquist sampling theorem [27] shows that if a function
The theory of Compressive Sensing mainly includes three part [26, 28]:
Sparse representation of the signal. Design of the measurement matrix, in order to reduce the dimension while ensuring the minimum loss of information of the original signal Design of the signal recovery algorithm, using
And the procedure of Compressive Sensing can be summarized as the following steps:
Assuming that the signal Find an observation base Observe the original signal with the observation base Using the optimization method to recover X from the high probability of the observed value.
Next, the theory of Compressive Sensing and sparse representation is explained from the mathematical point of view.
Assuming that
Schematic of sparse representation.
However, if we make
Then, we can obtain the following optimization target as follows:
In 2004, Donoho and Elad have proved mathematically that when A satisfies certain conditions, the Eq. (1) has a unique solution, referred in [29]. Still, the
Using this, the framework of Compressive Sensing (CS) has been formed originally. In reality, we usually use Eq. (3) instead of Eq. (2) to take noise into account.
Where A represents the sparse dictionary, x is called the sparse solution, and
The goal of the reconstruction algorithm is to find the solution of x, where the core of the whole problem is the sparse representation of y.
The real signal that exists in nature is generally not sparse, but rather sparsely in a transformation domain, which is a compressible signal. Or that theoretically any signal is compressive, as long as the corresponding sparse space can be found. The sparsity or compressibility of the signal is an important prerequisite and theoretical basis for compression perception [31]. The optimization model of sparse representation is designed from the perspective of signal reconstruction, but now sparse representation has made a great performance in the pattern recognition and artificial intelligence field [32]. For example, the Sparse-based Representation Classification algorithm (SRC) [33, 34] has achieved a great success in face recognition, image denoising and other aspects.
Thus, inspired by Compressive Sensing and SRC, this paper considers processing the imbalanced problem by sparse representation. When we need to synthesize the new samples of minority class, we need to solve the sparse solution in order to find the corresponding samples of minority class, and then produce a more evenly distributed samples of minority class by interpolation or increasing Gaussian noise.
Sparse-SMOTE
In this subsection, we improved SMOTE algorithm by using sparse representation instead of k nearest neighbors, and the technique is called Sparse-SMOTE.
In our method, the construction of a sparse dictionary is an important work. At present, there has been two types of construction method: human construction and training learning. The former contains the isotropic Gabor dictionary [35], and anisotropic Refinement-Gaussian dictionary [36] etc. The latter contains the dictionary learning algorithm, K-SVD [37]. We use training samples directly to construct the sparse dictionary in this paper.
First, a training set is provided, and then all of minority class samples are detached from that training set. The minority class is
Next, we normalizes every sample, and calculate their
Where
After obtaining
We calculate the solution of Eq. (5) by using the Homotopy algorithm, which is the fastest and best algorithm [38]. The Homotopy algorithm considers the following basic problem for lower noise.
Where
After obtaining the sparse solution, we randomly select the sample points represented by the nonzero terms in
For example, we obtain the subscript of nonzero sparse solution of
Where
Then we can synthesize a new sample as Eq. (8).
Where
(a) Original data distribution. (b) Current point 
(a) Original data distribution. (b) Current point 
The Sparse-SMOTE considers interpolation of the sparse solution, because the sparse solution may be not located around the current point. So, the synthetic samples obtained by interpolation are distributed more uniformly. At the same time, the sparse solution keeps minority class information and improves the performance of the algorithm. The overall procedure of the Sparse-SMOTE is described in Algorithm Sparse-SMOTE. In Line 11 to Line 15, Sparse-SMOTE obtains the subscript of nonzero sparse solution of Point_i instead of computing the k nearest neighbors for Point_i. In order to do that, we design a function named Get_Sparse_Index to calculate the subscript in Line 29 to Line 31. Finally, in Line 18 to Line 27, Sparse-SMOTE generates the synthetic samples by interpolation. Figure 3 shows the composition of the Sparse-SMOTE. (a) shows the original data distribution with the ratio of minority class to the majority class equals to 20:400. (b) marks the current point
As described before, the Sparse-SMOTE generates the synthetic samples by interpolating the sparse solution, but in this section, we attempt to use sparse dictionary directly to generate new samples. The method is called SROT.
First, we build the sparse dictionary
where
where
In addition, we can get better results if we modify a part of the nonzero term in
Finally, SROT generates the synthetic samples by final calculated sparse dictionary. The overall procedure of the SROT is described in Algorithm SROT. By experiments, when
(a) Original data distribution. (b) Current point 
The data distribution is the same as before shown in Fig. 4a. SROT calculates the sparse solution of every sample point of minority class samples, and adds random Gaussian noise to sparse solutions in order to disturb sparse solutions. Then, new sample data points are synthesized by the generated sparse dictionary. New synthetic sample points are marked using pentagrams in (b). The final synthetic result is shown in (c).
Compared with the result of the SMOTE and the result of Sparse-SMOTE, we can see that the sample points generated by the SROT are distributed more randomly, even though a few sample points are out of the distribution area of the original samples, being outlier point, injecting some noise.
On the one hand, of some datasets, SROT generates some noise points, resulting that some minority samples can be overlapped with the boundary of the majority due to noise and the distribution randomness, which has bad effects on the performance of total classification, But on the other hand, of some datasets, some parts of the datasets are relatively sparse due to the lack of minority class samples. SMOTE can reduce the problem to some degree but not solve this sparse problem, while SROT can make distribution of samples more uniform. Certainly, a more uniform data distribution does not always improve the final performance of classification, which depends on the importance of the regional sparsity relative to overall distribution.
Experimental design
In this section, we briefly describe the experimental datasets and statistical tests used alongside the experimental study. We use C4.5 [39] as the learning algorithm for the experimental study, which has been identified as one of the 10 top algorithms in Data Mining [40] and has been widely used in imbalance problems [41]. The experiment uses TP-Rate (True Positive Rate) [42], F-Measure [43], G-Mean (Geometric Mean) [44] and AUC (Area Under the ROC curve) [45] as the evaluation parameters.
Table 1 shows the description of these data-sets. For each one, the number of examples (#Ex.), attributes (#Attr.), name of each class (minority and majority), class distribution and the IR.
Description of the data-sets used in experiment
Description of the data-sets used in experiment
The experimental data sets in Table 1 contain both real-world data sets and training-testing data sets. The choice of the real-world datasets was based on the work on imbalanced classification with noisy and borderline examples presented in [46, 47]. The attribute type of data sets includes Nominal and Continuous marked by N and C, respectively. We cannot obtain nominal properties by calculations, so that a vectorization method is used. Vectorization refers to using a k-dimensional vector instead of the nominal attribute containing k kinds of values. For example, if any nominal attribute has three values: A, B and C, then we use a vector such as
The problem of multi-class classification can be converted to a binary classification problem to determine the target value. The samples conforming to the target value are classified as minority class while the others are majority class. We can obtain the target value of each dataset and its imbalanced degree shown in Table 1. However, some original UCI datasets are distributed regularly in some parts, so we need to break the concentrated samples, and vectorize nominal attributes before experiments. We also hope to restore the original distribution of the data, because our research is related to the data strategy. Therefore, we round the data to be integer after synthesis.
Finally, for each method, all these measures are estimated by stratified 10-fold cross validation. In order to decrease the randomness in different methods, all values of evaluation measures are the average of five independent 10-fold cross validation experiments.
We must point out that all experiments are conducted in MATLAB based on the Weka machine learning and data mining software. The classifier algorithm adopts the J48 algorithm (C4.5 decision tree) of the Weka [19] following the recommended parameter values in this platform. But the parameters of the SROT are variable and controllable. The parameters setup for the implementation of SROT used in this work has been determined experimentally in order to better fit it to the characteristics of imbalanced datasets with noisy examples. In our experiment, we let
Tables 2–5, Figs 5–8 show the TP-rate values, F-Measure values, G-Mean values, and AUC values of the five algorithms on ten datasets. The ratio of synthesis is set to the value which makes the dataset balanced. For instance, for Pima dataset whose imbalance degree is 1.9, its ratio of synthesis is set to 100. It can be relatively balanced after samples composition, when the ratio of the minority class to the majority class is 536/500.
Comparison of the TP-rate for five algorithms for test
Comparison of the TP-rate for five algorithms for test
Comparison of the F-Measure for five algorithms for test
TP-rate value comparison for five algorithms on ten datasets.
Comparison of the G-Mean for five algorithms for test
Comparison of the AUC for five algorithms for test
F-Measure value comparison for five algorithms on ten datasets.
G-Mean value comparison for five algorithms on ten datasets.
It can be seen that four kinds of sampling techniques can enhance classification performance on imbalanced datasets with the classifier compared to the original data processing. But the Random Over-Sampling leads to classifier over-fitting, resulting in limited performance improvement, which is proved in the experiments. On the other hand, the SMOTE can avoid over-fitting [10], and improve the performance of the imbalance problem using the over-sampling technique and interpolation theory. But performance on datasets using the SMOTE is not always better compared to the Random Over-Sampling. For example, Abalone and Ionosphere datasets show poor performance when using the SMOTE. The experiments are proved that there is no absolutely best algorithm to deal with all imbalanced datasets.
AUC value comparison for five algorithms on ten datasets.
Of the two new algorithms proposed in this paper, the performance of the Sparse-SMOTE is more stable. The Sparse-SMOTE significantly outperforms the SMOTE on seven datasets of the 10 test data samples, especially in Abalone, Balance, Pima and Satimage. Even for the other three datasets, its performance is also very close to the SMOTE. Taking randomness into consideration, we conclude that the Sparse-SMOTE is superior to the SMOTE in total.
However, the performance of the SROT is not so stable. The SROT outperforms the SMOTE on almost half of the datasets, including in Abalone, Balance and Mf-zernike. Especially on Balance dataset, the SROT significantly outperforms both SMOTE and Sparse-SMOTE. However, its performance is less effective for Pima, Satimage, Vehicle and Wphc. In spite of this, the SROT algorithm can improve the classification results of the imbalance problem.
According to the experimental results, we can obtain the following conclusion: the randomness used in the SROT can distribute samples more uniformly, enlarge the decision-making level, and improve the classification performance, which benefits sparse datasets which are short of typical data. However, excessive randomness leads to the minority class being overlapped with the majority class, generating noise. The SROT algorithm is more suitable for those classifiers that fail completely due to imbalanced datasets, such as Abalone, Balance etc. There is one interesting phenomenon found in our experiments. Not all of the imbalance causes serious damage to the performance of the classifier, for example, Letter dataset does not. Its degree of imbalance is 24.3, the most imbalanced of the 10 datasets, but classifiers show good recognition performance on it. This kind of phenomenon also exists for Ionosphere dataset, where the original data without any procession shows even more outstanding results compared to other sampling algorithms. So, the classifier may not get any benefits, even interfered by artificial data, if we use the over-sampling technique to process such datasets. These experimental results and relevant analysis teach us to pay more attention to the complexity of the imbalance problem, and show limitations of the over-sampling technique.
Ranking obtained through Friedmanâs test
Average AUC values over all datasets in Table 1 for each method.
In order to verify the effectiveness of our algorithms, we compare our methods with other methods for over sampling. In Fig. 9, the average AUC values over all 10 datasets of Table 1 are given for each method. It can be seen that Sparse-SMOTE and SROT improve the SMOTE, SMOTE-BL1 and SMOTE-BL2 [18] quite well. On the one hand, Sparse-SMOTE is approximately same as SMOTE-TL, but is not better than SMOTE-FRST [1] and SMOTE-RSB [48]. On the other hand, SROT improves SMOTE-TL and SMOTE-RSB, but do not improve SMOTE-FRST. Nevertheless, Fig. 9 show that if we use sparse representation for over-sampling, we still obtain good results.
In order to compare the results, we perform a statistical analysis conducted by non-parametric multiple comparison procedures [49] to find better preprocessing algorithms. We use Friedman’s procedure to compute the set of ranks that represent the effectiveness associated with each algorithm. In Table 6 we can observe that our proposals have great ranking.
We consider the synthetic ratio to make the values of the datasets balanced roughly in our experiments. But this value may not be the best ratio [5, 43]. This problem is discussed in this section. For instance, Fig. 10 to Fig. 13 show the changes in results of four synthesis techniques when the synthetic ratio changes of Balance dataset. In the figures, we use dotted lines to mark the former experimental ratio of 1100, which means that each sample point can compose 11 new samples.
Change of TP-Rate value with the synthetic ratio.
Change of F-Measure value with the synthetic ratio.
From Fig. 10 to Fig. 13, it is clear that the four algorithms perform better outside of the dotted line, which shows that the synthetic ratio to achieve balance is not the best value. If the minority class is larger than the majority class, the classifiers will identify the minority class better. But at the same time, if the minority class is identified by the classifiers terribly, the identification accuracy for the majority class will be damaged. Therefore, we need to find a balance between the two so as to achieve the best recognition efficiency.
Change of G-Mean value with the synthetic ratio.
Change of AUC value with the synthetic ratio.
It can be seen that the four indexes of 4 kinds of sampling algorithms increase with the synthetic ratio at first, but then decrease. The “turning point” of each algorithm is not the same. In the experiment of Balance dataset, SROT can maintain a good recognition performance, while the performance of the Sparse-SMOTE and SMOTE becomes unsatisfactory when the synthetic ratio is larger than about 1500 to 1800, in spite of TP-Rate increase. It can be concluded that the two algorithms achieve precision for the minority class in exchange for the recognition performance for the majority class. The “turning point” of the Random Over-sampling algorithm is almost 1500 as well, but when the ratio is larger than this value, the performance of this algorithm is stable.
To conclude, from the figures, the algorithms mentioned above can achieve close to best values of the TP-rate, F-Measure, G-Mean and AUC when the synthetic ratio of Balance dataset is set at 1500. The ratio of the minority class to the majority class becomes 784/576 (imbalance degree equals to 0.735). Certainly, this ratio changes as the dataset changes, but in our experiments, it is proved that the better classification performance is reached by increasing the proportion of the minority class appropriately, when the imbalance degree of the dataset is high.
The imbalance problem of datasets is one of the hottest areas of research in machine learning, pattern recognition and other fields. Traditional classification algorithms that do not consider the distribution of the sample tend to hard identify the minority class, leading to performance fail sharply. Accordingly, we put forward two kinds of sampling techniques based on sparse representation: the Sparse-SMOTE and SROT. According to the performance of ten UCI datasets in Table 1, the Sparse-SMOTE is the most stable method, which can obtain superior classification results regardless of how the dataset changes. This fact is because the k-nearest-neighbors SMOTE algorithm cannot reflect the real distribution of data, meaning that new synthetic sample points are also located around the original data samples, which leads to sparse area being still sparse and so being a dense area. However, the Sparse-SMOTE can overcome this defect, and it can make synthetic sample points uniform in order to enlarge the decision area. At the same time, the algorithm is based on the sparse representation of the overall sample, which is easy to reserve the information of the original dataset avoiding the blindness problem. The SROT can improve the uniformity of the datasets in spite of the consequences that are likely to cause data overlapping and noise enhancing. The SROT can significantly improve the performance of the classifiers and effectively solve the non-equilibrium problem compared to the original dataset and random sampling techniques. Thus, both of our proposed approaches can be widely used in the fields of information theory, earth science, life sciences, etc.
In the future we want to take this work a step further by applying more and further data cleaning techniques on datasets preprocessed by our methods. Moreover, we hope to use other strategies to change the sparse solution for generating synthetic samples and attempt to reduce time complexity.
Footnotes
Acknowledgments
This study was supported by Scientific and Technological Development Scheme of Jilin Province under Grant No. 20180101048JC.
