Abstract
In supervised classification if one of the classes has fewer objects than the other, we have a class imbalance problem. One of the most common solutions to address class imbalance problems is oversampling, and SMOTE is the most referenced and well-known oversampling method. However, SMOTE creates synthetic objects in a random way, therefore it produces a different result each time it is applied, and in practice the user has to apply SMOTE several times for choosing the best of all the generated balanced datasets. For this reason, in this paper, we present SMOTE-D, a deterministic version of SMOTE, and propose new deterministic SMOTE-D-based versions of some of the most recent and successful SMOTE-based methods. In our experiments, we show that all proposed deterministic methods produce as good results as random methods but our proposals need to be applied just once. This is very important from a practical point of view since our proposals save time by avoiding multiple applications of them as SMOTE does and they provide one unique result.
Introduction
In supervised classification when one of the classes has fewer objects compared to the other class, it means we have a class imbalance problem [9, 36]. This difference in the number of objects between classes produces a bias in the classifiers towards the class with the largest number of objects (majority class), which results in poor classification performance for the class with fewer objects (minority class).
In the literature, some authors [10, 33] identify two types of solutions for the class imbalance problem: algorithmic-level solutions and data-level solutions. Algorithmic-level solutions [8, 38] change the strategies of the classifiers to impose a bias in order to improve the classification performance for the minority class. Meanwhile, data-level solutions add or remove objects to/from the classes in order to balance the dataset. Data-level solutions are divided into three types: subsampling [13, 35], oversampling [3, 29] and hybrid resampling [2, 25]. In addition, some authors [20] recognize another type of solutions denominated cost-sensitive solutions [19, 37], which take into account different cost of misclassification for each class. It is important to underline that some authors [27] argue that cost-sensitive solutions are included into algorithmic-level solutions.
In this paper, we will focus on SMOTE [4] (Synthetic Minority Over-sampling TEchnique), which is one of the most successful and referenced data-level solutions for class imbalance problems. SMOTE, in order to balance the classes, randomly creates synthetic objects between objects of the minority class. Due to its simplicity and good results, SMOTE has been modified and/or combined with other techniques; resulting in other oversampling methods (SMOTE-based methods) which perform better than SMOTE. However, since SMOTE is based on randomly creating synthetic objects, it produces a different balanced sample each time it is applied. This makes that, in the practice, SMOTE has to be applied several times in order to choose among all the balanced samples the best one. For this reason, in this paper, we present a deterministic oversampling method named SMOTE-D, which, unlike SMOTE, creates synthetic objects for the minority class in a deterministic way, producing the same balanced sample each time that it is applied. Preliminary ideas of SMOTE-D were reported in [32], but in this new paper, we further explain each one of its steps as well as we show a more complete experimental evaluation, including a comparison of SMOTE-D against the original SMOTE. Additionally, we introduce and evaluate new deterministic oversampling methods by incorporating SMOTE-D into some of the most recent and successful oversampling SMOTE-based methods. Similarly, we perform an extensive experimental comparison of SMOTE-based and SMOTE-D-based methods.
The rest of this paper is organized as follows: in section 2, we review the related work; in section 3, the proposed methods are introduced; in section 4, our experimental results are reported and discussed; Finally, in section 5, our conclusions and some directions for future work are provided.
Related work
In this section, we review SMOTE and some of the most successful and recent SMOTE-based methods proposed in the literature which can be divided into two types: SMOTE-combinations and SMOTE-modifications. SMOTE-combinations [2, 25– 27] combine SMOTE with other methods like subsampling methods, classifiers, noise filters or dimensionality reduction methods. On the other hand, SMOTE-modifications [3, 24] modify some of the steps of SMOTE, for example, how the objects used as basis for creating new synthetic objects are chosen; allowing the use of objects from the majority class for creating synthetic objects; creating synthetic objects closer or farther away from the minority class objects; among others.
In this work, we focus on SMOTE modifications, since for SMOTE combinations do not become new over-sampling methods.
SMOTE
SMOTE [4], see Algorithm 2.1, is an oversampling method in which depending upon the required amount of oversampling (it is a parameter), some synthetic objects are created and added to the minority class to balance the classes. For each object in the minority class, SMOTE randomly selects one of its nearest neighbors, also from the minority class, to create new synthetic objects between them. The new synthetic objects are created as follows: the difference between the attribute vector of an object of the minority class and the attribute vector of the randomly selected nearest neighbor is computed. After, this difference is multiplied by a random number between 0 and 1 and added to the attribute vector of the object of the minority class.
1. em Compute the amount of oversampling (n) for the objects of the minority class.
2.
3.
4. Randomly choose one of its k-nearest neighbors.
5.
6. Compute the attribute difference between the neighbor and the object. (dif [l] = neighbor [l] - object [l])
7. Synthetic new [l] = obecj [l] + dif [l] * random (0, 1).
8.
9. Add the new synthetic object to the synthetic subset
10.
11. n = n - 1
12.
13. Add the synthetic subset to the training set.
SMOTE modifications
In this subsection, we will briefly describe some of the most successful and recent modifications of SMOTE. These methods oversample the minority class by randomly creating synthetic objects following different strategies.
Borderline-SMOTE1 [12], for balancing a dataset, first computes for each object in the minority class its k-nearest neighbors in the whole training set. Then, each object of the minority class is labeled as border if the number of its k-nearest neighbors belonging to the majority class is greater than or equal to k/2 but smaller than k. Finally, Borderline-SMOTE1 creates synthetic objects as SMOTE does, but only between objects labeled as border and their k-nearest neighbors in the minority class.
Borderline-SMOTE2 [12] labels the objects of the minority class as Bordeline-SMOTE1 does. Then, for each border object in the minority class, Borderline-SMOTE2 computes its k-nearest neighbors in the minority class and its nearest neighbor in the majority class, and applies SMOTE over the border objects using these k+1 neighbors. But, when a neighbor is randomly chosen for creating a synthetic object between it and a border object, if the chosen nearest neighbor is the one belonging to the majority class, the difference between the attribute vector of the border object under consideration and the chosen neighbor is multiplied by a random number between 0 and 0.5. Otherwise the attribute difference is multiplied by a random number between 0 and 1 as in SMOTE.
Safe Level SMOTE [3], for each object of the minority class, computes a safe level (sl) value depending on the amount of objects of the minority class among its k-nearest neighbors in the whole training set. The higher the number of neighbors of the minority class, the higher the safe level value for an object. Then, Safe Level SMOTE creates synthetic objects as SMOTE does, but if the safe level of an object is 0, Safe Level SMOTE will not create synthetic objects for this object. For creating synthetic objects, when a nearest neighbor is randomly chosen, Safe Level SMOTE works as follows: if the safe level of the chosen neighbor is 0, the object of the minority class is duplicated. Otherwise, Safe Level SMOTE computes a safe level ratio (slr) by dividing the safe level of the object of the minority class under consideration and the safe level of the chosen neighbor. Later, if slr = 1, the difference between the attribute vector of the object under consideration and its neighbor (attribute difference) is multiplied by a random number between 0 and 1; if slr > 1, the attribute difference is multiplied by a random number between 0 and 1/slr, this will create a synthetic object closer to the object than to the neighbor; if slr < 1, the attribute difference is multiplied by a random number between 1 - slr and 1, this will create a synthetic object closer to the neighbor than to the object.
M-SMOTE [14] labels each object of the minority class as latent noise if all its k-nearest neighbors computed in the whole training set belong to the majority class; as security if anyone of its k-nearest neighbors belongs to the majority class; or as border if the object is neither security nor latent noise. For latent noise objects, M-SMOTE does not create synthetic objects; for security objects, M-SMOTE creates synthetic objects between them and their k-nearest neighbors in the minority class; and for each border object, M-SMOTE only creates synthetic objects between it and its nearest neighbor. For both, security and border objects, the synthetic objects are created in the same way as in SMOTE.
SMOTE-Out [15] works as SMOTE, but when creating a synthetic object for an object in the minority class, SMOTE-Out first creates a synthetic object between the object of the minority class and one of its k-nearest neighbors in the majority class (randomly chosen), but multiplying the attribute difference by a random number between -0.3 and 0, in this way this synthetic object will be closer to the object of the minority class than to its neighbor in the majority class, but in the opposite side. After, the created synthetic object is used for creating another synthetic object between it and one of the k-nearest neighbors in the minority class (also randomly chosen) of the object from which the first synthetic object was created, but multiplying the attribute difference by a random number between 0 and 0.5, by doing this the synthetic object will be closer to the first synthetic object created than to the chosen nearest neighbor of the minority class.
SMOTE-Cosine [15] creates synthetic objects in the same way as SMOTE, but combining the Euclidean and Cosine distances to get the k-nearest neighbors. First, separately for each distance SMOTE-Cosine ranks the neighbors of an object of the minority class based on its distance value. This produces two rankings. After, the neighbors are ranked according to the sum of their rank position in each ranking. Finally, based on this new ranking SMOTE-Cosine computes the k-nearest neighbors.
Selected SMOTE [15] works as SMOTE but, for creating synthetic objects, it only uses some attributes chosen by a feature selection method (any feature selection method can be used). In the created synthetic object the non-selected attributes will have the same values as the object from which they were created.
It is important to point out that SMOTE and all SMOTE-based methods described above have random steps either at choosing neighbors or at creating synthetic objects. Therefore, they produce a different result each time they are applied, and in the practice they have to be applied several times in order to chose the best one balanced sample.
Proposed Method
In this section, we first describe SMOTE-D, which is a deterministic version of SMOTE. Later, we introduce deterministic versions of some SMOTE-modifications. We only describe how to use SMOTE-D for SMOTE-modifications because in SMOTE-combinations replacing SMOTE by SMOTE-D is a trivial task.
SMOTE-D
As we mentioned in section 2, since SMOTE and SMOTE-based methods randomly create synthetic objects, these methods produce a different result each time they are applied. Therefore, these methods usually have to be applied several times for choosing the best result from all balanced datasets. In order to eliminate this randomness from SMOTE and SMOTE-based methods, we propose a deterministic version of SMOTE which we name SMOTE-D. SMOTE-D is based on three ideas: 1) deterministically computing the number of synthetic objects that will be created around each object of the minority class. 2) From the number of synthetic objects computed in (1) for each object of the minority class, deterministically computing how many synthetic objects will be created between each object of the minority class and each one of its k-nearest neighbors. 3) uniformly creating synthetic objects between each object of the minority class and each one of its k-nearest neighbors.
SMOTE-D, as SMOTE, first determines the total amount of over-sampling (n) for the minority class, but according to the difference between the number of objects of the minority class (|m|) and the majority class(|M|); unlike SMOTE which only takes into account the size of the minority class. SMOTE-D computes n as in Expression 1:
Where |M| is the number of objects in the majority class, |m| is the number of objects in the minority class, R ∈ I R is a parameter that allows computing the amount of oversampling as a portion of the difference between |M| and |m|. For example, if |m|=10, |M|=100 and R = 1 then n = round ((100 - 10) *1) =90. If we want an oversampling of 80% of the difference between the classes, we need R = 0.8, consequently, the value of n will be n = round ((100 - 10) *0.8) =72.
Given the amount of oversampling n, in order to compute the portion of objects from n that will be created around each object of the minority class, SMOTE-D takes into account the standard deviation of the distances between each object in the minority class and its k-nearest neighbors. Notice that a high standard deviation represents that there is a high difference among the distances from an object and its k-nearest neighbors, we assume this situation as a high opportunity for creating synthetic objects. Therefore, the greater the standard deviation of the distances between an object and its k-nearest neighbors, the greater the number of synthetic objects to be created around this object.
Then, for computing the portion of synthetic objects p
i
from n that will be created around each object O
i
(i = 1, . . . , |m|) in the minority class, SMOTE-D computes the standard deviation σ
i
of the distances between O
i
and its k-nearest neighbors
According to Expression 2, more synthetic objects will be created around those objects of the minority class having higher standard deviation of distances between them and their k-nearest neighbors.
For example, let O1, O2 and O3 be three objects in the minority class and suppose that the standard deviations of the distances between them and their k-nearest neighbors are σ1 = 2, σ2 = 3 and σ3 = 5, see Fig. 1. According to Expression 2 we have

Three objects of the minority class (O1, O2 and O3) with 3-nearest neighbors each one and their standard deviations of the distances (dashed circles).
Once determined the portion p
i
of synthetic objects from the total amount of oversampling, that will be created around an object O
i
of the minority class, these synthetic objects will be created between it and its k-nearest neighbors depending on their distance. The farther a nearest neighbor is, the more room for creating synthetic objects. Then, the portion
Considering the object O3 in our previous example, let suppose the distances between O3 and its neighbors

An object (O3) of the minority class with 3-nearest neighbors (
Summarizing, up to this point, given the amount of oversampling (n), we have explained how to compute the portion p
i
, from n, of synthetic objects to create around each object O
i
in the minority class; i = 1, . . . , |m|. Also, we have explained how to compute the portion
In order to avoid the random creation of synthetic objects, which could produce undesired situations such as synthetic objects too close (or too far) from O
i
and/or
In order to illustrate the uniform creation of synthetic objects, let (1.0, 0.5, - 1.0) be the feature vector of O
i
and let (5.0, 1.0, 1.0) be the feature vector of its nearest neighbor
Three synthetic objects created for O
i
= (1.0, 0.5, - 1.0) and

3 synthetic objects uniformly created between the object O3 of the minority class and one of its k-nearest neighbors
1. em Compute the amount of oversampling n for the minority class (using Expression 1).
2. Compute the portions p i from n for each object O i in the minority class (using Expression 2).
3. Compute the portions
4. Compute the number
5.
6.
7. Divide the attribute difference between the object and the nearest neighbor by the number of synthetic objects to be created between them plus one (using Expression 5).
8. Generate new synthetic objects by adding
9. Add the new synthetic objects to the synthetic subset.
10.
11.
12. Add the synthetic subset to the training set.
In this subsection, we introduce deterministic versions of the most recent and referenced SMOTE-based methods by using SMOTE-D.
In Borderline SMOTE-D1 border objects of the minority class are labeled in the same way as in Borderline SMOTE1 [12]. After, SMOTE-D is applied over the border objects in order to deterministically create synthetic objects between each one of them and its k-nearest neighbors in the minority class.
In Borderline SMOTE-D2 border objects are labeled in the same way as in Borderline SMOTE1 [12]. Then, in the same way as Borderline SMOTE2 does, Borderline SMOTE-D2 will work over each border object, its k-nearest neighbors, and its nearest neighbor in the majority class, but applying SMOTE-D. For this, first the number of synthetic objects that SMOTE-D will create around each border object and between it and each one of its k-nearest neighbors is computed, but for computing this number, the distance between a border object and its nearest neighbor in the majority class is multiplied by 0.5, in order to create more synthetic objects around neighbors from the minority class than around neighbors from the majority class. Additionally, when SMOTE-D creates a synthetic object between a border object and its nearest neighbor in the majority class, the difference of attributes is multiplied by 0.5; this preserves the idea of Borderline SMOTE2 about creating the synthetic objects closer to the border object than to the nearest neighbor in the majority class.
Safe Level SMOTE-D computes for each object in the minority class a safe level value (sl), as it is done in Safe Level SMOTE [3]. Then, the number of synthetic objects to be created around each object of the minority class and between it and each one of its k-nearest neighbors is computed in the same way as in SMOTE-D. But, if the safe level of an object of the minority class is 0, Safe Level SMOTE-D will not create synthetic objects between this object and its k-nearest neighbors. When creating synthetic objects, if the safe level of a neighbor is 0, the object of the minority class is duplicated as many times as the computed number of synthetic objects to be created between them. Otherwise, Safe Level SMOTE-D computes a safe level ratio (slr) between the object of the minority class and its neighbors; as it is done in Safe Level SMOTE. Then, before creating the synthetic objects (in a uniform way, as in SMOTE-D) between an object of the minority class and one of its k-nearest neighbors, the attribute difference is multiplied by a different value depending on the slr value as follows: if slr = 1, the attribute difference is multiplied by 1, this means, this difference is used as in SMOTE-D; if slr > 1, the attribute difference is divided by the safe level ratio, in order to create synthetic objects closer to the object of the minority class than to its nearest neighbor, as in Safe Level SMOTE; and if slr < 1, the attribute difference is multiplied by slr, besides, the synthetic objects will be created between the object of the minority class plus the attribute difference (multiplied by the safe level ratio) and its nearest neighbor, this allows creating synthetic objects closer to the nearest neighbor than to the object, just as in Safe Level SMOTE.
In M-SMOTE-D the objects of the minority class are labeled as latent noise, border or security as in M-SMOTE [14]. Then, M-SMOTE-D computes the number of synthetic objects to be created around each object of the minority class in the same way as SMOTE-D. Then, following the idea of M-SMOTE for latent noise objects, M-SMOTE-D does not create synthetic objects; for security objects, M-SMOTE-D creates synthetic objects as SMOTE-D; and for each border object, M-SMOTE-D only creates synthetic objects between the border object and its nearest neighbor in the minority class, in a uniformly way as SMOTE-D.
SMOTE-D Out computes the number of synthetic objects to be created around each object of the minority class and each one of its k-nearest neighbors as in SMOTE-D. Then, SMOTE-D Out creates a synthetic object between each object of the minority class and its nearest neighbor of the majority class (always the nearest to avoid the randomness). This synthetic object is created by multiplying the attribute difference between them by -0.3, this allows creating this synthetic object close to the object of the minority class and in the opposite side from its nearest neighbor in the majority class. Then, the created synthetic objects are used, instead of the objects of the minority class for creating the synthetic objects for the minority class, but multiplying the attribute difference by 0.5. This will create the synthetic object closer to the first created synthetic objects than to the nearest neighbors.
SMOTE-D Cosine works in the same way as SMOTE-D, but for computing the k-nearest neighbors, SMOTE-D Cosine uses as distance the sum of the Euclidean and Cosine distances. This distance is also used for computing the number of synthetic objects to be created around each object of the minority class as well as the number of synthetic objects to be created between the object of the minority class and each one of its k-nearest neighbors.
Selected SMOTE-D creates synthetic objects as in SMOTE-D, but using only the selected attributes, as it is done in Selected SMOTE [15]; leaving the remaining attributes as they are in the object of the minority class.
It is important to highlight that, in all SMOTE-D-based methods, all the random steps are eliminated; resulting in deterministic versions which produce the same outcome each time they are applied.
Experimental results
In this section, we compare the results of SMOTE-D and the SMOTE-D-based methods proposed in this paper against the results of SMOTE and SMOTE-based methods. We perform the assessment through several supervised classifiers. The experimental setup is shown in subsection 4.1. Subsection 4.2 shows a comparison between SMOTE-D and SMOTE, and subsection 4.3 shows a comparison of the proposed SMOTE-D-based methods against SMOTE-based methods.
Experimental Setup
In our experiments we used the datasets shown in Table 2 taken from the KEEL repository [1], all these datasets contain two-class problems and have a 5-fold cross validation partition. Table 2 shows the number of attributes (# attributes), the number of objects (# objects) and the IR of each dataset. The IR is computed as the ratio between the size of the majority class and the size of the minority class. In Table 2, the datasets appear in ascending order regarding their IR. In our comparisons, we use the following supervised classifiers: Decision Trees (DT), Support Vector Machines (SVM) and K-Nearest Neighbors (KNN) with K = 5, which have been commonly used in the literature. For assessing the results of oversampling methods we employ Area Under ROC Curve (AUC) and F-measure (F-M). For validating our results we apply a t-test with a confidence level of 95%.
Datasets from the KEEL repository used in our experiments
Datasets from the KEEL repository used in our experiments
All of the oversampling methods described in this paper were implemented in MATLAB, in a PC with an AMD FX-8320 Eight-Core processor running at 3.5 GHz, 8GB of RAM and 64-bit Windows7.
For Selected SMOTE-D and Selected SMOTE we use Stepwise Regression [7] as feature selection method. This is a commonly used method for dimensionality reduction.
In this subsection, we show the results of comparing of SMOTE-D against SMOTE in Table 3.
Results of SMOTE-D vs SMOTE
Results of SMOTE-D vs SMOTE
The first column (Distance) of table 3 shows the distance/metric used for finding the k-nearest neighbors, the second column (k) shows the number of k-nearest neighbors used for oversampling, the third column (Classifier) indicates the used supervised classifier, the following double columns (AUC and FM) show the results of SMOTE-D and SMOTE in terms of the metric on the top. To get the results under these columns, SMOTE-D and SMOTE were applied over each one of the 5-folds cross validation partitions of the 66 datasets shown in Table 2 and the average is reported. For SMOTE we report the average of 10 runs of SMOTE and the standard deviation. In this table we can see that both methods obtain similar classification results. Moreover, by applying a t-test with a confidence level of 95%, we did not find statistical significant differences in the results produced by both methods.
In this subsection, in Tables 4 and 5, we show the results of the comparison of the proposed SMOTE-D-based methods against SMOTE-based methods using the Euclidean and HVD metrics respectively.
Results of SMOTE-D-based methods vs SMOTE-based methods using the Euclidean distance
Results of SMOTE-D-based methods vs SMOTE-based methods using the Euclidean distance
Results of SMOTE-D-based methods vs SMOTE-based methods using HVDM
In these tables, the first column shows the used oversampling methods: Cosine (COS), Borderline (B1), Borderline2 (B2), Safe Level (SL), Out (Out), M (M) and Selected (Sel). The second column shows the employed supervised classifier. After, two double columns show the results, in terms of AUC and F-M, of the SMOTE-D-based methods and the SMOTE-based methods, respectively. To get the results under these columns, the respective oversampling method was applied over each one of the 5-folds cross validation partitions of the 66 datasets shown in Table 2, and the average is reported. In the column SMOTE, we report the average of 10 runs of the respective SMOTE-based method and the standard deviation.
After applying a t-test with a confidence level of 95%, over the results shown in Tables 4 and 5, we did not find statistical significant differences, which means that both kinds of methods produce similar results.
Additionally, we measured the time spent by SMOTE and SMOTE-D when they are applied once over the 66 datasets. After applying a t-test with a confidence level of 95%, we did not find a statistically significant difference between their runtimes. However, since SMOTE produces a different outcome each time it is applied, SMOTE must be applied several times to select from all the results the best oversampled dataset. This selection involves training a classifier each time SMOTE is applied. In practice, the time required for selecting the best oversampled partition highly increases the total time required for applying SMOTE.
This paper introduces SMOTE-D, a deterministic version of SMOTE, as well as some deterministic SMOTE-D-based oversampling methods. The comparison of SMOTE-D vs SMOTE and SMOTE-D-based methods vs SMOTE-based methods, in terms of F-Measure and AUC using different classifiers show that SMOTE-D and SMOTE-D-based methods obtain as good results as SMOTE and SMOTE-based methods, but working in a deterministic way, i.e. producing the same good result every time they are applied. Therefore, SMOTE-D and SMOTE-D-based methods do not need to be applied several times to obtain a good result. This is important from a practical point of view since our proposals save time by avoiding multiple applications of them, unlike SMOTE-based methods do, and they provide the same result each time they are executed.
In this paper, we have modified some SMOTE-based methods for getting deterministic versions of these oversampling methods by including the SMOTE-D approach. As future work, we will develop deterministic oversampling methods specifically designed for taking advantage of the SMOTE-D characteristics. Additionally, we are going to extend the proposed method for working on mixed data.
Footnotes
Acknowledgments
This work was partly supported by National Council of Science and Technology of Mexico under the scholarship grant 627301.
