Optimization of SMOTE for imbalanced data based on AdaRBFNN and hybrid metaheuristics

Abstract

Oversampling ratio $N$ and the minority class’ nearest neighboring number $k$ are key hyperparameters of synthetic minority oversampling technique (SMOTE) to reconstruct the class distribution of dataset. No optimal default value exists there. Therefore, it is of necessity to discuss the influence of the output dataset on the classification performance when SMOTE adopts various hyperparameter combinations. In this paper, we propose a hyperparameter optimization algorithm for imbalanced data. By iterating to find reasonable $N$ and $k$ for SMOTE, so as to build a balanced and high-quality dataset. As a result, a model with outstanding performance and strong generalization ability is trained, thus effectively solving imbalanced classification. The proposed algorithm is based on the hybridization of simulated annealing mechanism (SA) and particle swarm optimization algorithm (PSO). In the optimization, Cohen’s Kappa is used to construct the fitness function, and AdaRBFNN, a new classifier, is integrated by multiple trained RBF neural networks based on AdaBoost algorithm. Kappa of each generation is calculated according to the classification results, so as to evaluate the quality of candidate solution. Experiments are conducted on seven groups of KEEL datasets. Results show that the proposed algorithm delivers excellent performance and can significantly improve the classification accuracy of the minority class.

Keywords

SMOTE hyperparameter optimization PSO SA AdaRBFNN imbalanced classification

1. Introduction

Imbalanced data classification serves as the focus of machine learning and data mining. The traditional machine learning algorithm is based on the maximum overall classification accuracy, which is likely to make the model incline to the majority class and ignore the minorities. As a result, despite the high classification accuracy of the majority class, the generalization ability of the model is very poor. If the test set contains more instances of the minority class, the classification accuracy will be quite low. The actual classification application often pays more attention to the prediction results of the minority class. Due to the lack of training for them, the model fails to grasp useful information, so the classification results with high accuracy become meaningless for imbalanced datasets. In reality, the data is often high latitude and imbalanced. In many fields, such as medical diagnosis [1], image recognition [2], network intrusion detection [3], data imbalance problems happen from time to time. Generally, the misclassification cost of the minority class is higher. Therefore, in the face of imbalanced data, we are supposed to strive to improve the classification performance of the minority class.

Currently, the Internet mainly adopts large-scale machine learning to process data, despite of this, in real practice, most of the categories have no data accumulation, therefore, large-scale learning is not fully applicable. Under this situation, learning the model from small sample data within a short period of time turns out to be necessary. In the few-shot learning, the imbalance of data has a significant impact on the classification performance, so the quality of training set is quite significant. In this paper, it will study the unbalanced data classification in few-shot learning based on data enhancement.

SMOTE is widely used as an oversampling technique to improve the class distribution of datasets. However, it is difficult to select the optimal variables $N$ and $k$ , which determine the number of new synthesized minority instances. This paper works to propose a hyperparameter optimization algorithm for imbalanced data, find a set of hyperparameter subsets which are close to the absolute optimum for SMOTE to synthesize an appropriate amount of the minority class, so as to build a high-quality dataset, train a classifier with excellent performance, and obtain reliable classification results. The proposed algorithm can obviously improve the classification accuracy of the minority class, and thanks to the combination of two metaheuristics’ advantages, not only the convergence speed is fast, so that particles can quickly jump out of the local optimum, but also avoid blind selection of hyperparameters, which is more time efficient than grid search. We test the performance of hyper parameter optimization algorithm, RUSBoost and CUSboost on seven sets of imbalanced datasets. The results demonstrate that the algorithm is competitive compared with the mentioned technology. It is a promising and valuable method for solving the problem of imbalanced classification in the field of data mining.

The remainder of the paper is organized as follows. Section 2 summarizes the current related methods on imbalanced classification. Section 3 presents the SMOTE algorithm and points out its blindness. Section 4 constructs the AdaRBFNN classifier. Section 5 describes the details of the proposed hyperparameter optimization algorithm, and briefly introduces the metaheuristic techniques used. Experimental results on 7 groups of imbalanced datasets are discussed in Section 6. Finally, the conclusion is presented in Section 7.

2. Related work

At present, the research of imbalanced data classification can be divided into two categories: data level and algorithm level. As for the data level, the main methods are oversampling and under-sampling. That is to say, the dataset is balanced by copying the minority class instances or deleting the majority class instances. The synthetic minority oversampling technique (SMOTE) proposed by Chawla et al. in 2002 is deemed as a representative one [4], which manages to balance the dataset by synthesizing the new instances by random linear interpolation between the minority class and their $k$ nearest neighbors in the feature space. To some extent, it avoids over fitting. In 2005, some work put forward Boderline-SMOTE technology [5], which divides the minority class into three categories: noise instances, safety instances and danger instances. SMOTE algorithm is adopted to the nearest neighbor interpolation for danger instances, so that the distribution of the synthetic minority class instances becomes more reasonable. In 2008, there is some work which proposed the Adaptive Synthetic Sampling Approach (ADASYN) [6]. This method focuses on using a mechanism to automatically determine how many synthetic instances each instance of minority class requires to generate, but it is susceptible to outliers.

When it comes to the algorithm level, the main methods include cost sensitive learning [7] and ensemble learning [8]. Cost sensitive learning focuses on the cost of misclassification, which deals with data imbalance by assigning different penalty parameters to the majority class and the minority class. In ensemble learning, several weak classifiers are combined into one strong classifier to improve the overall generalization performance, including boosting algorithm and bagging algorithm. Among them, AdaBoost algorithm is one of the typical algorithms of boosting algorithm. In the iterative process, the method reduces the weight of the correctly classified instances and improves the weight of false classification instances in the previous round. Later, it obtains different base classifiers through various training data, and finally constructs a strong classifier by weighted voting.

At present, many researches have integrated the sampling method on the basis of ensemble learning. In 2010, Seiffert et al. proposed an algorithm named RUSBoost [9], which combines random under-sampling (RUS) with AdaBoost algorithm. Because of the uncertainty of random sampling method, sometimes the instances are not representative, so the improvement of model is not obvious. In 2003, Chawla et al. proposed SMOTEBoost algorithm [10]. Through the combination of SMOTE with AdaBoost algorithm, SMOTEBoost is able to create more extensive decision-making areas for the minority class compared to the standard boosting, but does not consider the impact of the number of synthetic minority class instances on the performance of model. In 2017, Rayhan et al. proposed a new clustering-based undersampling approach with AdaBoost, called CUSBoost algorithm [11]. It uses k-means algorithm to cluster the majority class instances into several clusters and selects some instances from each cluster to form a balanced dataset. However, when the feature space is not suitable for clustering, the performance of CUSBoost will be greatly reduced.

3. SMOTE algorithm

SMOTE is an improved algorithm based on random oversampling technology, which improves the generalization ability of classifier in test set and reduces the risk of over fitting. It concentrates on creating synthetic minority class instances by random linear interpolation between the minority class and its nearest neighbors and adding them to the original dataset, thus improving the imbalance of the dataset. The algorithm flow is as follows:

(1)
Suppose the original training set is $T$ , the minority class is $X$ , among which $X=\{x_{1},x_{2},x_{3},\linebreak\cdots,x_{i},\cdots,x_{\textit{num}}\}$ . Calculate the euclidean distance from each instance $x_{i}$ to other instances of the minority class, and get $k$ nearest neighbors of $x_{i}$ .
(2)
Determine a sampling rate $N$ . For each instance $x_{i}$ of the minority class, randomly select the appropriate number from its $k$ nearest neighbors and record it as $x_{ij}$ , which represents the jth nearest neighbor instance of $x_{i}$ , $j=1,2,\cdots,k$ .
(3)
The new minority class instances are synthesized by random linear interpolation between $x_{i}$ and $x_{ij}$ :

$\displaystyle x_{\textit{new}}=x_{i}+\textit{rand}(0,1)\times(x_{ij}-x_{i})$ (1)

$\textit{rand}(0,1)$ represents a random number between 0 and 1.
(4)
A new training set is formed by combining the newly synthesized instances with the original training set $T$ .

It can be told from the algorithm flow that the parameter $k$ specifies the number of nearest neighbors and the parameter $N$ controls the number of oversampling. The value of $N$ and $k$ determines the number of instances synthesized by SMOTE algorithm. They are a group of variables that are difficult to choose the best. It remains unclear how many instances should be synthesized to improve the classification performence of model as much as possible, and there exists great blindness.
4. AdaRBFNN algorithm

Radial Basis Function Neural Network (RBFNN) [12] is a kind of Feedforward Neural Network with simple topological structure. It can approximate any nonlinear function with any precision and embraces strong nonlinear mapping ability. Compared with the traditional BP neural network, RBF neural network has only one hidden layer. The transformation from the input layer to the hidden layer is nonlinear, while the transformation from the hidden layer to the output layer is linear. The distance between the input mode and the center vector is taken as the function independent variable, and the radial basis function is used as the activation function. As a result, RBF neural network is not only equipped with excellent generalization performance, but also enjoys the best approximation performance. It is able to overcome the defect that BP neural network falls into local minima easily. Moreover, it has fast training speed, and has does well in classification problems.

With the intention of further improving the classification performance and generalization ability of the model, we work to train multiple RBF neural networks to integrate a new classifier AdaRBFNN based on AdaBoost algorithm. The classifier is applied to the proposed hyperparameter optimization algorithm, so as to evaluate the impact of training data under different hyperparameter combinations on the classification performance of the model. In this paper, the radial basis function is Gaussin function:

$\displaystyle R(X-c_{i})=\exp\left(-\frac{1}{2\sigma_{i}^{2}}||X-c_{i}||^{2}% \right),i=1,2,\cdots,h$ (2)

At this time, the output of RBF neural network is:

$\displaystyle y_{j}=\sum_{i=1}^{h}{w_{ij}}\exp\left(-\frac{1}{2\sigma^{2}}\|{X% -c_{i}}\|^{2}\right),j=1,2,\cdots,P$ (3)

$X=[x_{1},\cdots x_{k}]^{T}$ is the input sample; $h$ is the number of hidden layer neurons, and $P$ is the number of output instances; the center $c_{i}$ of the hidden layer neuron is a vector with the same dimension as $X$ , which is randomly selected from the input sample; $\sigma$ is the variance. It is determined by Eq. (4):

$\displaystyle\sigma_{i}=\frac{d_{i}}{\sqrt{2h}}$ (4)

$d_{i}$ is the maximum distance between the selected centers; $w_{ij}$ is the weight between the hidden layer and the output layer, which can be directly calculated by the least square method:

$\displaystyle\omega_{ij}=\exp\left(\frac{h}{d_{i}^{2}}\|{X-c_{i}}\|^{2}\right)$ (5)

The training process of AdaRBFNN is as follows:

Input: Training data $S=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{M},y_{M})\}$ , where $x_{i}\in X\subseteq R^{n}$ , $y_{i}\in Y=\{-1,1\}$ .

(1)

Initialize the weight distribution of training data:

$\displaystyle D_{1}(i)=\frac{1}{M},i=1,2,\cdots,M$

(2)

For $t=1$ to $T$ :

(a)

The base classifier is trained by the weight distribution $D_{t}$ :

$\displaystyle\textit{RBFNN}_{t}(x):X\to\{-1,1\}$

(b)

Calculate classification error rate:

$\displaystyle e_{t}=\sum_{i=1}^{M}{D_{t}(i)}I(\textit{RBFNN}_{t}(x_{i})\neq y_% {i})$

(c)

Calculate the weight of the base classifier:

$\displaystyle\alpha_{t}=\frac{1}{2}\ln\frac{1-e_{t}}{e_{t}}$

(d)

Update the weight of training data:

$\displaystyle D_{t+1}(i)=\frac{D_{t}(i)}{Z_{t}}\exp(-\alpha_{t}y_{i}\textit{% RBFNN}_{t}(x_{i}))$

$Z_{t}$ is the normalization factor:

$\displaystyle Z_{t}=\sum_{i=1}^{M}{D_{t}}(i)\exp(-\alpha_{t}y_{i}\textit{RBFNN% }_{t}(x_{i}))$

(3)

End for

(4)

Construct a linear combination of base classifiers and output a strong classifier AdaRBFNN:

$\displaystyle\textit{AdaRBFNN}(x)=\textit{sign}\left[\sum_{t=1}^{T}{\alpha_{t}% }\textit{RBFNN}_{t}(x)\right]$

5. Hyperparameter optimization algorithm

As for grid search, parameters are required to be adjusted according to step size within the specified parameter range. This method needs to go through all possible hyperparameter combinations, which is quite expensive in computing resources. Sometimes to save time and cost, the relatively sparse grid structure will be selected, but the optimal value might be missed. Therefore, we propose to optimize the selection of these two hyperparameters based on SA and PSO, so as to improve the classification performance of AdaRBFNN and effectively cope with the problem of imbalanced data classification.

5.1 Simulated annealing algorithm

Simulated annealing algorithm (SA) is a stochastic optimization algorithm based on Monte Carlo iterative solution strategy [13], which adopts the Metropolis algorithm and properly controls the temperature drop process to realize simulated annealing. The algorithm steps are as follows:

(1)
When the initial temperature $T_{s}$ is large enough, $s=0$ , the initial solution is $x_{0}$ , the number of iterations at each temperature is $L$ .
(2)
$s=1,2,\cdots,L$ perform steps (3) to (4) for.
(3)
According to the temperature at present, apply a random disturbance to $x$ , thus generating a new feasible solution ${x}^{\prime}$ . The difference value of the objective function is calculated $\Delta f=f({x}^{\prime})-f(x)$ . Accept ${x}^{\prime}$ according to the probability $\min\{1,\exp(-\Delta f/T_{k})\}>\textit{random}[0,1]$ , $\textit{random}[0,1]$ is a random number in $[0,1]$ .
(4)
Suppose $T_{s+1}=CT_{s}$ , $s=s+1$ , $C\in(0,1)$ , if the termination condition is satisfied, the current solution will be output as the optimal solution, and the annealing process will stop; otherwise, turn to step (2).

In the iterative process of the algorithm, the optimal solution is searched according to the probability by decreasing the control parameter value, so the algorithm jumps out of the local optimal solution. When the initial temperature is high enough and annealing is slow enough, it can converge to the global optimal solution with probability 1. The acceptance probability of annealing process decreases as the temperature drops, and various states are generated randomly. The new one is not necessarily better than the previous one. The search strategy of SA is conducive to avoiding falling into the local optimal solution and has strong robustness. However, high initial temperature and proper annealing speed are usually required. If the parameters are not set properly, the search efficiency and performance definitely will be affected.
5.2 Particle swarm optimization algorithm

Particle swarm optimization (PSO) is an intelligent algorithm designed by simulating the predatory behavior of birds [14]. PSO is initialized as a group of random particles, and then the optimal solution is found through iteration. In the $D$ dimensional target search space, we assume that there is a population of $Q$ particles, where the position of particle $i$ is expressed as $X_{i}^{t}=(x_{{i1}}^{t},x_{{i2}}^{t},\cdots,x_{iD}^{t})$ , $i=1,2,\cdots,Q$ , and its flight speed is expressed as $V_{i}^{t}=(v_{{i1}}^{t},v_{{i2}}^{t},\cdots,v_{iD}^{t})$ , $i=1,2,\ldots,Q$ . $X_{i}^{t}$ is a potential solution, the fitness value can be obtained by substituting it into the objective function. Note that the individual extremum as $p_{{\textit{best}}}^{t}=(p_{{i1}}^{t},p_{{i2}}^{t},\cdots,p_{{iD}}^{t})$ , $i=1,2,\ldots,Q$ , and the global extremum as $g_{{\textit{best}}}^{t}=(p_{g1}^{t},p_{{g2}}^{t},\cdots,p_{g_{D}}^{t})$ . In each iteration, particles will update their speed and position according to the following formula:

$\displaystyle v_{{}_{ij}}^{t+1}=w\cdot v_{ij}^{t}+c_{1}r_{1}[p_{{}_{ij}}^{t}-x% _{{}_{ij}}^{t}]+c_{2}r_{2}[p_{{gj}}^{t}-x_{ij}^{t}]$ (6) $\displaystyle x_{{ij}}^{t+1}=x_{{ij}}^{t}+v_{{ij}}^{t+1}$ (7)

$c_{1}$ and $c_{2}$ are the learning factor; $r_{1}$ and $r_{2}$ are the random number within $[0,1]$ ; inertia weight $w$ reflects how much the current speed of particles inherits the previous speed, which is used to balance the local search ability and the global search ability. PSO is dynamically adjusting $w$ by linear decreasing weight strategy, and the formula is as follows:

$\displaystyle w=w_{\max}-\frac{(w_{\max}-w_{\min})\cdot t}{T_{\max}}$ (8)

$T_{\max}$ represents the maximum number of iterations, $t$ is the current number of iterations, $w_{\max}$ represents the maximum inertia weight, $w_{\min}$ represents the minimum inertia weight. In 1999, Shi et al. proposed that the decrease of inertia weight from 0.9 to 0.4 can make particles have better global search ability in the early stage and better local search ability in the later stage [15]. In this paper, $w_{\max}={0.9}$ , $w_{\min}={0.4}$ .

Each particle in the solution space moves at a certain speed and gathers to its historical best position $p_{\textit{best}}$ and population best position $g_{\textit{best}}$ to realize the evolution of the candidate solution. As we all know, PSO has few parameters and is easy to realize. Although it is equipped with strong global search ability for nonlinear and multimodal problems, it also has premature convergence and will fall into local optimization problems easily.

5.3 The proposed algorithm

In order to avoid the local convergence of PSO, we introduce simulated annealing mechanism into its optimization process. Moreover, this the proposed hybrid metaheuristic will be used as framework of hyperparameter optimization algorithm to search the ideal $N$ and $k$ to balance the class distribution of training data within the parameter range. This allows particles to jump out of the local optimal solution, overcomes the premature convergence problem, enhances the diversity of the population, and improves the convergence speed. In this paper, the fitness function is constructed based on Cohen’s Kappa [16]. The process of solving the fitness function is the process of solving Kappa coefficient. According to the classification results of AdaRBFNN, Kappa of each generation is calculated, and the index is used as a standard to judge the quality of candidate solutions. Finally, the optimal solution is obtained when Kappa coefficient is maximized through initeration, which is the global optimal value of $N$ and $k$ . The flow chart of hyperparameter optimization algorithm is shown in Fig. 1.

Figure 1.

Flow chart of hyperparameter optimization algorithm.

Kappa coefficient is calculated based on confusion matrix, which is used for consistency test and measure classification accuracy. This indicator is able to tell whether the classification accuracy is in the confidence interval. The calculation formula is:

$\displaystyle\textit{kappa}=\frac{p_{0}-p_{e}}{1-p_{e}}$ (9)

$p_{0}$ represents the overall classification accuracy, $p_{e}$ is the sum of the product of the actual and predicted quantity of each type of sample divided by the square of the total number of instances. According to Eq. (9), the higher the imbalanced confusion matrix is, the higher the $p_{e}$ is, and the lower the Kappa is, thus the more biased the model can be punished. Kappa is usually between 0 and 1, which can be divided into five levels to show different degrees of consistency: 0.0 $\sim$ 0.20 means extremely low, 0.21 $\sim$ 0.40 is general, 0.41 $\sim$ 0.60 is good, 0.61 $\sim$ 0.80 means highly consistent and 0.81 $\sim$ 1 is almost completely consistent. Imbalanced data classification can sometimes deliver high accuracy, while Kappa is very low. To ensure the reliability of the accuracy, it is also necessary to ensure that Kappa is as large as possible. Therefore, in this paper, the Kappa coefficient is taken as the objective function value, and the control condition is set in the iterative process: if Kappa $<$ 0.4, then Kappa $=-\infty$ . That is to say, when Kappa coefficient is not less than 0.4, try to get a higher classification accuracy. At that moment, it is meaningful. The condition accelerates the elimination of inferior solution and improves the optimization efficiency.

The algorithm first initializes the population and parameters, and calculates the weight of each generation. Then SMOTE algorithm is implemented for each particle, and AdaRBFNN is trained by the new class distribution. According to the classification results, the Kappa coefficient of each generation of particles is calculated. The new solution is accepted by the Metropolis criterion $\min\{1,\exp(-(\textit{fit}(x_{i})-\textit{fit}(x_{i}^{t}))/T\}>\textit{random% }[0,1]$ , and the current best Kappa value and individual optimal position are stored in $f_{im}$ and $p_{id}$ respectively, until all particles of the generation are traversed circularly. At this time, the global optimal position and the global optimal fitness value are stored in $p_{gd}$ and $f_{gm}$ respectively, and then the particle speed and position are updated. When $w$ is large, the global convergence ability is strong, each particle is able to search for a better solution in the global range with a larger speed and step size. When it comes to the later stage, $w$ is small, the local convergence ability turns out strong, and the particle can search more precisely near the extremum. With the decreasing of annealing temperature, the algorithm will not accept any inferior solution and converge to the global optimal solution until the maximum number of iterations is satisfied. Finally, the global optimal solution of $N$ and $k$ is obtained to realize SMOTE algorithm. After balancing the class distribution of dataset, a satisfying classifier can be trained. The specific optimization process is shown in Algorithm 1.

Algorithm 1: Learning procedure of hyperparameter optimization algorithm
1: Set parameters such as the number of population pop and the maximum number of iterations $G$
2: Set the initial temperature, cooling coefficient and other parameters
3: for $j=1$ to pop
4: Initialize each particle
5: end for
6: for $i=1$ to $G$
7: Calculate the current weight $w_{ij}$
8: for $j=1$ to pop
9: Calculate the objective function $f_{ij}$ of each particle at present
10: if $f_{ij}<0.4$
11: $f_{ij}=-\inf$ ;
12: end
13: if $f_{ij}>f_{im}$ // $f_{im}$ is the best fitness value of individual in history
14: $f_{im}=f_{ij}$
15: $p_{id}=x_{ij}$ // assign the value $x_{i}$ currently to its historical best position $p_{id}$
16: else
17: Calculate $\Delta f=f_{ij}-f_{im}$
18: $P=\exp(-\Delta f/(C*T))$
19: if $\textit{rand}<P$
20: $f_{im}=f_{ij}$
21: $p_{id}=x_{ij}$
22: end if
23: end if
24: end for
25: if $\max(f_{im})>f_{gm}$ // $f_{gm}$ is the best fitness value of population in history
26: $f_{gm}=\max(f_{im})$
27: $p_{gd}=x_{g}$ // assign the value of the best position $x_{g}$ in the current population to the best position $p_{gd}$ in the
27: population throughout history
28: end if
29: for $j=1$ to pop
30: calculate particle velocity $v_{ij}$
31: update particle position $x_{ij}$
32: end for
33: end for

5.4 Time complexity of the proposed algorithm

The time complexity of the algorithm is able to be analyzed by the big $O$ notation, which helps describe how the size of input data influences the algorithm’s use of computing resources. To determine the big $O$ , it is required to study the code execution process and loop conditions. For the proposed algorithm, the time complexity is $O\{GQ[RD+T(T+N)]\}$ , among which $G$ represents the number of iterations at the outermost layer, $Q$ is the number of initial population, $R$ implies the number of trained weak classifiers, $D$ is the number of features of the training data, $T$ means the number of the minority class instances, and $N$ represents the oversampling ratio. To be more specific, this method is a nested loop. For each particle in the two-dimensional search space, some subfunctions are executed, thus completing the instructions of the algorithm and generating solutions according to the number of dimensions. Subfunctions have their own loops as well, and each loop size determines the time complexity of the algorithm.

6. Experiment

6.1 Data set description

To test the high applicability of the algorithm in this paper, 7 groups of imbalanced datasets in different fields of KEEL database are applied in the experiment. The source of selected datasets is reliable and no missing value exists there. Through drawing the boxplot, we manage to describe the discrete distribution of the data, during which no obvious abnormal value is found, thus cleaning is unnecessary. Finally, we carry out the assimilated processing towards the data. Referring to Table 1 for more details. In these datasets, the number of classes is two, which means we are dealing with imbalanced binary classifications. IR is the ratio between the number of instances from the majority and minority classes.

Table 1
Datasets

Datasets	Instances	Features	Negative	Positive	IR
Wisconsin	683	9	444	239	1.86
Glass6	214	9	185	29	6.38
Yeast-2vs4	514	8	463	51	9.08
Ecoli4	336	7	316	20	15.8
Led7digit	443	7	406	37	10.97
Page blocks-1-3vs4	472	10	444	28	15.56
Abalone-21vs8	581	8	567	14	40.5

6.2 Evaluation index

In the task of classification II, the majority class is regarded as negative and the minority class as positive. The confusion matrix of classification results is shown in Table 2.

Table 2
Confusion matrix of classification results

Real cases	Prediction results
	Positive examples	Negative examples
Positive	$T P$	$F N$
Negative	$F P$	$T N$

The following measures can be defined by confusion matrix:

$\displaystyle\textit{recall}=\frac{TP}{TP+FN}$ $\displaystyle\textit{precision}=\frac{TP}{TP+FP}$ $\displaystyle\textit{specificity}=\frac{TN}{TN+FP}$ $\displaystyle\textit{F-measure}=\frac{(1+\beta^{2})\times\textit{precision}% \times\textit{recall}}{\beta^{2}\times\textit{precision}+\textit{recall}}$

The relative importance of recall to precision is measured by $\beta$ , usually taken as 1. F-measure can comprehensively consider recall and precision, and both of them are used to reflect the classification performance of positive instances. As a result, the higher the F-measure, the better the classification effect of models for positive instances.

$\displaystyle\textit{G-mean}=\sqrt{\textit{recall}\times\textit{specificity}}$

This index requires to take the classification performance of the model for two types of instances into consideration at the same time. When both are large, it can get better G-mean. The larger the G-mean is, the better the overall classification performance of the model is. When the data is imbalanced, the index is of great reference value. In this paper, F-measure and G-mean are used to evaluate the classification effect of different algorithms on imbalanced data.

6.3 Experimental results and analysis

Firstly, according to the classification accuracy and Kappa of the model, the optimization effect of this algorithm on SMOTE is verified. The experimental results are shown in Table 3.

Table 3
Optimal hyperparameter combination, Kappa and classification accuracy based on hyperparameter optimization algorithm

Datasets	Best $N$	Best $k$	Kappa	Accuracy (%)
Wisconsin	1	1	0.9509	98.0488
Glass6	7	3	1	100
Yeast-2vs4	1	2	0.6466	92.9032
Ecoli4	1	6	0.7285	96.0396
Led7digit	2	5	0.6478	92.4812
Page blocks-1-3vs4	2	3	1	100
Abalone-21vs8	8	7	0.6615	98.8571

Figure 2 demonstrate the convergence of the proposed hyperparameter optimization algorithm on seven datasets. The horizontal axis is the number of iterations, and the vertical axis represents the optimal value of the objective function in the population.

Figure 2.

The convergence curve of the optimal fitness value in the population changing along with the number of iterations. a. Wisconsin convergence curve. b. Glass6 convergence curve. c. Yeast-2vs4 convergence curve. d. Ecoli4 convergence curve. e. Led7digit convergence curve. f. Page blocks-1-3vs4 convergence curve. g. Abalone-21vs8 convergence curve.

According to the experimental results, the Kappa values are greater than 0.4, and the classification accuracy is higher than 90%, which proves that the optimization of SMOTE algorithm is successful. Among them, the Kappa values of four datasets rank between 0.61 and 0.80, indicating that the model classification results are highly consistent with the actual results; the Kappa values of the dataset Wisconsin are greater than 0.8, proving that the model classification results are almost identical with the actual results; the datasets Glass6 and Page blocks-1-3vs4 are completely classified correctly. Due to the randomness of metaheuristic algorithm, the algorithm in this paper can find a set of hyperparameter subsets which are close to the absolute optimum in 30 iterations, so that it is able to converge to a better Kappa, and make the model obtain reliable classification results. The grid search requires to try every parameter combination, and the time will increase with the increase of $N$ . In contrast, this algorithm turns out to be time efficient and improves the performance of classifier.

Next, we compare the proposed hyperparameter optimization algorithm with CUSBoost and RUSBoost, and use G-mean and F-measure for each data set to verify the classification performance of the model. The experimental results are shown in Table 4.

Table 4

G-mean and F-measure drawn from three algorithms

Datasets	G-mean			F-measure
	RUSBoost	CUSBoost	Proposed method	RUSBoost	CUSBoost	Proposed method
Wisconsin	0.9356	0.8563	0.9809	0.9166	0.8035	0.9643
Glass6	0.7746	0.9414	1	0.75	0.7362	1
Yeast-2vs4	0.9332	0.9443	0.8685	0.625	0.6605	0.6857
Ecoli4	0.9677	0.9477	0.9567	0.6667	0.5469	0.75
Led7digit	0.7919	0.8924	0.9211	0.4545	0.6741	0.6875
Page blocks-1-3vs4	0.9773	0.9782	1	0.7143	0.7625	1
Abalone-21vs8	0.9778	0.9786	0.7071	0.4444	0.5148	0.6667

From Table 4, it can be told that that the hyperparameter optimization algorithm does better than RUSBoost and CUSBoost in the F-measure of the selected datasets, and its G-mean is better in most datasets. Among them, the datasets Glass6 and Page blocks-1-3vs4 are completely classified correctly, as both of them reach 1. The experimental results prove that the algorithm in this paper is better for the classification of the minority class, and helps effectively cope with the problem of imbalanced data classification.

7. Conclusion

This paper works to study the unbalanced classification in few-shot Learning, and in pursuit of this goal, it adopts the research method of making full use of the existing data for data enhancement. The hyperparameter optimization algorithm put forward in this paper is based on PSO framework. By introducing simulated annealing mechanism, particles manage to jump out of local optimization at a high speed and convergence speed get improved. When it comes to the optimization process, the fitness function is constructed based on Kappa coefficient. And the proposed AdaRBFNN is regarded as a classifier, which works to evaluate the effect of different combination of hyperparameters on the model classification according to the Kappa value. Through iteration, we are allowed to find a set of ideal hyperparameters to implement SMOTE algorithm. It can not only deals with dataset imbalance, but also gets high-quality data sets by synthesizing a minority of instances. By doing this, it manages to train a classifier with excellent performance, thus obtaining the reliable classification results.

Metaheuristic algorithm helps avoid blind selection of parameters. It follows the optimization goal step by step, adopting heuristic search in the pre-set search space to efficiently search the improved solution for iteration, which is more time efficient than grid search. Through extensive experiments and comparison with other classification techniques, the results demonstrate that the proposed method embraces high applicability and competitiveness. The trained AdaRBFNN model is equipped with strong generalization ability and excellent classification performance, which can significantly improve the classification accuracy of the minority class. The proposed method helps effectively deal with imbalanced classification, which can provide a new idea for this kind of problem in data mining. Taking the computing power of traditional computers into account, AdaRBFNN is more suitable for few-shot Learning. However, the hyperparameter optimization algorithm in this paper is universal, and users are able to choose any classifier according to different datasets.

The specificity of the proposed method is that the fitness function is designed for SMOTE. Since Kappa coefficient serves as an effective method to measure the classification performance in response to the consistency of test dataset, we choose Kappa to define the fitness function, thus telling if the classification accuracy is within the confidence level. With the aid of this method, it helps explain the reliability of AdaRBFNN model. Moreover, it also directly reflects the enhancement effect of SMOTE after parameter optimization. The universality of this method lies in that users are allowed to define classifiers freely based on different data sets, or apply the proposed framework based on PSO and SA to parameter optimization of other algorithms. During this process, we are required to do nothing but define a reasonable fitness function. For example, it can be applied to seek the optimal solution of penalty factor $C$ and the kernel function parameter $\sigma$ in SVM, and the position of each particle can be expressed as ( $C,\sigma$ ); it can also optimize the set of weight and threshold in BP neural network, and the dimension of the particle speed and position vector is equal to the sum of weight and threshold. In the future, we can further study the combination of enhanced SMOTE and other ensemble learning.

Footnotes

Acknowledgments

This research was supported by the Natural Science Foundation of Liaoning Province under Grant No. 201602259. The authors sincerely thank the laboratory of science college of Northeastern University for its equipment support and the Liaoning natural science foundation for its financial support.

References

Khalilia

Chakraborty

and Popescu

, Predicting disease risks from highly imbalanced data using random forest, BMC Medical Informatics and Decision Making 11(1) (2011), 51–51.

Liu

Y.H.

and Chen

Y.T.

, Face recognition using total margin-based adaptive fuzzy support vector machines, IEEE Transactions on Neural Networks 18(1) (2007), 178–192.

Ismail

In-Ho

and Ravi

, An intrusion detection system based on multi-level clustering for hierarchical wireless sensor networks, Sensors 15(11) (2015), 28960–28978.

Chawla

N.V.

Bowyer

K.W.

Hall

L.O.

et al., SMOTE: synthetic minority oversampling technique, Journal of Artificial Intelligence Research 16 (2002), 321–357.

Han

Wang

W.Y.

and Mao

B.H.

, Borderline-SMOTE: a new oversampling method in imbalanced data sets learning, Lecture Notes in Computer. Sci 3644(5) (2005), 878–887.

H.B.

et al., ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning, in: 2008 IEEE International Joint Conference on Neural Networks, IEEE, New York, 2008, pp. 1322–1328.

Wozniak

Krawczyk

and Schaefer

, Cost-sensitive decision tree ensembles for effective imbalanced classification, Applied Soft Computing 14(1) (2013), 554–562.

Galar

et al., A review on ensembles for the class imbalance problem: bagging, boosting, and hybrid-based approaches, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 42(4) (2012), 463–484.

Seiffert

et al., RUSBoost: a hybrid approach to alleviating class imbalance, IEEE Transactions on Systems Man and Cybernetics-Part A Systems and Humans 40(1) (2010), 185–197.

10.

Chawla

N.V.

et al., SMOTEBoost: improving prediction of the minority class in boosting, in: European Conference on Principles of Data Mining and Knowledge Discovery, Springer, Berlin, 2003, pp. 107–119.

11.

Rayhan

et al., CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification, in: Proceedings of 2017 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution, IEEE, New York, 2017, pp. 70–75.

12.

Gupta

Jin

and Homma

, Radial Basis Function Neural Networks, in: Static and Dynamic Neural Networks: From Fundamentals to Advanced Theorys, IEEE Press, Piscataway, NJ, USA, 2007, pp. 223–252.

13.

Kirkpatrick

Gelatt

C.D.

and Vecchi

M.P.

, Optimization by simulated annealing, Science 220(1983), 671–680.

14.

Kennedy

, Particle swarm optimization, in: Encyclopedia of Machine Learning, Springer, Boston, 2011, pp. 760–766.

15.

Shi

Y.H.

and Eberhart

R.C.

, Empirical study of particle swarm optimization, in: Proceedings of the Congress on Evolutionary Computation, IEEE Service Center, Piscataway, NJ, USA, 1999, pp. 1945–1950.

16.

Cohen and J., A coefficient of agreement for nominal scales, Educational & Psychological Measurement 20(1) (1960), 37–46.

17.

J.Y.

Fong

and Zhuang

, Optimizing SMOTE by Metaheuristics with Neural Network and Decision Tree, in: Proceedings of 2015 3rd International Symposium on Computational and Business Intelligence, IEEE Computer Society, Washington DC, 2015, pp. 26–32.

18.

Cao

Zhao

D.Z.

and Osmar

, Hybrid probabilistic sampling with random subspace for imbalanced data learning, Intelligent Data Analysis 18(6) (2014), 1089–1108.

19.

Kong

X.F.

et al., Boosting weighted ELM for imbalanced learning, Neurocomputing 128 (2014), 15–21.

20.

Blaszczynski

and Stefanowski

, Neighborhood sampling in bagging for imbalanced data, Neurocomputing 150 (2015), 529–542.

21.

Zou

X.G.

Feng

Y.P.

H.Y.

and Jiang

S.Y.

, Improved oversampling techniques based on sparse representation for imbalance problem, Intelligent Data Analysis 22(5) (2018), 939–958.

22.

Guo

H.P.

Zhou

C.A.

She

and Xu

M.L.

, Ensemble based on feature projection and under-sampling for imbalanced learning, Intelligent Data Analysis 22(5) (2018), 959–980.

23.

Sáez

J.A.

Luengo

Stefanowski

and Herrera

, SMOTE-IPF: addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015), 184–203.

24.

Soda

, A multi-objective optimization approach for class imbalance learning, Pattern Recognit, 44(8) (2011), 1801–1810.

25.

Hou

B.L.

and Liu

J..J.

, An anti-noise ensemble algorithm for imbalance classification, Intelligent Data Analysis 23(6) (2019), 1205–1217.

26.

Cano

et al., ur-CAIM: improved CAIM discretization for unbalanced and balanced data, Soft Computing 20(1) (2016), 173–188.