A method for balancing a multi-labeled biomedical dataset

Abstract

In this paper, we propose a data balancing method for multi-label biomedical data. The method can be applied in the case of semantic segmentation problems for balancing the corresponding image data. The proposed method performs oversampling of instances of minority classes in a way that increases the frequencies of appearance (a ratio of number of samples, containing this class, over the total number of samples in the dataset) of minority classes in the data, thereby reducing the class imbalance. The effectiveness of the proposed method is shown experimentally by applying it to two highly unbalanced biomedical image datasets. A convolutional neural network (CNN) was trained on several versions of those datasets: one balanced with the proposed method, another balanced with manual oversampling and an unbalanced version. The results of the experiments validate the effectiveness of the proposed method, proving that it allows the influence of class imbalance on the learning algorithm to be reduced, thus improving its original classification results for most of the classes. Apart from biomedical image data, the proposed method was applied to several common multi-label datasets. Inherently, the proposed method does not make any assumptions about the underlying structure of the data to be balanced; therefore, it can be applied to all types of data (vectors, images, etc.) that can be described in a multi-label framework. It also can be used in conjunction with any learning algorithm that is suitable for multi-label data. To illustrate its wider applicability, a series of experiments was conducted using seven common multi-label datasets. An experimental comparison to existing multi-label data balancing approaches is provided, as well. The experimental results show that the proposed method presents a competitive alternative to existing approaches.

Keywords

Multi-label data multi-label balancing imbalanced data neural networks convolutional network fundus biomedical data semantic segmentation

ï»¿

1. Introduction

Recently, deep convolutional models have had success in many applications related to image understanding [1, 2, 3, 4]. Apart from common industrial problems (object detection, person identification, etc.), convolutional neural networks were successfully applied to biomedical challenges, such as electroencephalogram analysis [5, 6, 7, 8, 9, 10], gait recognition [11] and biomedical image segmentation [12, 13, 14]. In this paper, we mainly consider the latter. Due to medical specificity, biomedical image segmentation comes with a number of challenges, such as small amounts of data and severe class imbalance.

Figure 1.

Class frequencies (a) of a balanced dataset; (b) of a dataset where instances of only the 4th class are less frequent; (c) of a dataset where instances of class #3 and #4 appear more frequently than the instances of classes #1 and #2.

A class imbalance [15] is a situation when samples of some classes appear more frequently in the dataset than the samples of other classes (see Fig. 1). Class imbalance influences the classification results of many algorithms [16, 17, 18], especially machine learning algorithms [19, 20]. Due to this problem, the classification rate on samples of the rare (minority) classes is low [21, 22]. This problem is particularly relevant to biomedical challenges [23, 24], as it is common that the information about a disease is contained in the samples of the less-populated classes, causing them to be underrepresented. Underrepresentation leads to poor model performance on those classes; consequently, a disease-detecting test may give incorrect results. Another reason biomedical data are often imbalanced is that different diseases have different (often low) rates of appearance (e.g., kidney failure). This results in the fact that images of one type (healthy subjects) are easier to collect because they are more common than the others (subjects that have a disease). Therefore, obtaining a well-balanced dataset is often not feasible.

In order to tackle the class imbalance problem, various kinds of methods have been proposed. Existing data balancing methods operate either at an algorithm level (no data modification) or at data level (resampling, modifying data). The latter are based on two common techniques [25]: undersampling and oversampling. With undersampling, data samples that belong to the predominant (major) classes are removed. With oversampling, data samples that belong to the rare (minority) classes are generated (or copied).

These techniques can be easily applied when each data sample belongs to a single class; however, data balancing becomes challenging when each data sample belongs to multiple classes. In this case, each data sample is associated with a set of labels (labelset) instead of only a single class label. Such datasets are called multi-label datasets. Applying oversampling or undersampling to such datasets is challenging, because copying or deleting any datum points causes changes in the distribution of several classes. An example of such a change in the classes’ distribution could be oversampling of instances of one of the minority classes – this particular minority class will become more frequent, yet frequencies of other minority classes may decrease, making the overall class imbalance worse. This problem is called a multi-label data imbalance [26, 27].

A number of multi-label data balancing methods have been proposed to solve the multi-label class imbalance problem. However, to the best of our knowledge, none of the existing data balancing approaches, with the exception of manual balancing, can be applied to image data in the context of image segmentation. This is mainly because the existing data balancing approaches do not perform copying or deleting of samples of the dataset, but perform the generation of new data samples using heuristics, such as data interpolation, which is not applicable in the case of images.

Here, we propose a new method for multi-label data balancing in order to address multi-label data imbalance in biomedical image data. The effectiveness of the proposed method is validated on two highly imbalanced biomedical image datasets. The results of the experiment show that the proposed method improves the original classification results. Apart from image data, the proposed method has been applied to seven common multi-label datasets to illustrate its wider applicability beyond image data. A comparison to existing multi-label data balancing approaches is provided, as well.

The paper is structured as follows:

•

Section 2 presents a survey of existing data balancing approaches.

•

Section 3 provides a detailed description of the proposed method, presents its implementation details and gives recommendations regarding its application.

•

Section 4 presents a thorough experimental study that validates the effectiveness of the proposed method.

2. Related works

This section provides an overview of existing data balancing methods and algorithms.

To overcome data imbalance, different methods and algorithms have been proposed. These can be divided into three main groups:

(I) Algorithm-level approaches, which do not modify the distribution of the data but evaluate (in terms of, for example, misclassification cost) instances from different classes differently in correspondence with the number of instances of a certain class in the data. This type of approach is the most common – examples may include:

•
imbalance-sensitive score functions [28, 29, 30];
•
adaptive algorithms [31];
•
ensembles of classifiers [32, 33], and
•
attention-based classification [34].

(II) Data-level approaches, which modify the distribution of the instances so that the minority class is adequately represented in the dataset. A more detailed analysis of approaches of this type is given below.

(III) Mixed approaches (combining both algorithm-level and data-level approaches). It should be noted that in this area of research, there is a lack of comparative studies that can provide conclusions on which approach should be used to obtain the best result in a given scenario. As some of such studies show [35], overall, the hybrid strategies are more effective. One of the examples of mixed approaches is proposed in [36], where a dynamic GAN is used to generate realistic-looking data samples of a minority class (data-level approach), thus balancing the dataset, and the parameters of the subsequently used classifier depend on the evaluations of the changes in the imbalance ratio and the samples’ distributions (algorithm-level approach).

However, due to the fact that the proposed method is a data-level approach, in this paper, we focus only on this type of approach (namely, the resampling algorithms). These approaches can be differentiated into two subgroups:

•
undersampling algorithms and
•
oversampling algorithms.

(i) Undersampling approaches, in essence, are based on the removal of data samples. One of the data balancing approaches of this type, Multi-Label Tomek Link (MLTL) [37], is based on Tomek link [38] undersampling. Two instances define a Tomek link if, firstly, one is the nearest neighbor of another and vice versa, and secondly, if they belong to different classes. Originally, the Tomek link undersampling algorithm addressed a single-label classification problem. MLTL is a generalization of the Tomek link algorithm for multi-labeled datasets. The MLTL approach uses IRLbl and MeanIR metrics to determine the minority classes as well as the adjusted Hamming distance to calculate differences between the labelsets. The essence of Tomek link undersampling is in removing the closest samples (in feature space) with different labelsets from the dataset.

(ii) The oversampling approaches, conversely, are based on the generation of new data samples. This technique was employed in [38, 39, 40, 41].

Works [39, 40] were focused on single-label datasets. In SMOTE [40], a new data sample from a minority class (minority data sample) is generated by adding a vector to a data sample of this class. The vector is obtained by multiplying a vector between the considered data sample and one of its k-nearest neighbors (in feature space) of the same class by a random number. ADASYN [39] is based on SMOTE and proposes an idea of adaptively generating data samples of the minority classes according to their distributions.

Both SMOTE and ADASYN approaches are focused on binary classification only. Both approaches can be easily applied to a single-label classification problem on structured data. These approaches are not applicable to multi-label class imbalance problems, but they are often used as a base for multi-label balancing approaches.

MLSMOTE [41] is an adaptation of the SMOTE approach for the multi-label problem. In order to select minority classes and measure the degree to which the dataset is unbalanced, the authors suggest considering all labels separately and calculating two metrics: Imbalance Ratio per Label (IRLbl) [27] and Mean Imbalance Ratio (MeanIR) [27]. The IRLbl value shows the imbalance level for a specific label. The MeanIR stands for an average level of imbalance in the dataset. In the MLSMOTE approach, a class is considered minor if its IRLbl is lower than MeanIR. Special attention is given to synthetic data generation, especially to synthetic labelset production.

A further development of this approach was proposed by the same authors in [42]. In this paper, the authors consider the concurrence between imbalanced labels, proposing special metrics, namely SCUMBLE (Score of ConcUrrence among iMBalanced LabEls) and SCUMBLELbl, in order to assess this concurrence along with the resampling algorithm, called REMEDIAL (REsampling MultilabEl datasets by Decoupling highly ImbAlanced Labels).

In summary, it should be noted that, on one hand, approaches based on the undersampling technique are unsuitable for small datasets, since the removal of any of the data sample may lead to overfitting. This is especially the case for biomedical datasets, which commonly consist of very few data. On the other hand, the aforementioned oversampling approaches are not applicable to images. MLSMOTE, ADASYN and REMEDIAL use interpolation between data samples to generate new data, which implies that the number of features is always the same, and that each of the features are independent of each other. Such assumptions, however, do not hold for images. Firstly, images may differ in size, and secondly, images exhibit a spatial structure, meaning that pixels are dependent on their neighbors, which goes against the assumption about features’ independence. As a result, interpolation in the case of images would lead to degenerate data samples.

The method proposed in this paper neither focuses on the feature space for new data generation nor uses undersampling. It performs balancing on a data level via oversampling in a way that increases the frequencies of minority classes. In this context, the frequency of a class means the percentage of data samples containing that class. This method allows us to simultaneously increase the frequencies of minority classes, which is especially challenging in the case of multi-label datasets, the reason being that during manual balancing, increasing the frequency of one class may lead to a decrease in the frequency of another class. This makes the proposed method superior in cases where there are only few data available or if relying on the feature space is impossible (as in the case of images) or when manual oversampling is difficult (which is the case when there are a lot of classes in the data).
3. Proposed method

In this section, we provide a detailed description of the proposed method and its implementation details, and give a thorough explanation of how to apply it.

In the multi-label setting, data imbalance is expressed as some of the classes being significantly more frequent than the others. An ideal case would be when all labels have frequencies close to 1. The proposed method balances the data by oversampling the minority classes (duplicating data samples containing those classes), making it as close to the ideal case as possible given the constraint on the number of duplicates that can be generated.

It should be noted that the described ideal case (all frequencies close to 1) is true in the context of biomedical image segmentation, where it is desirable to have each class present in every image. This, however, may not be desirable in the context of multi-label classification, as individual labelsets present a form of compositional classes. In this case, making all classes’ frequencies close to one may harm data variety, making certain labelsets abundant compared to others. In this case, other approaches of data balancing may be preferable. Nevertheless, as the experimental results show, the proposed method is able to improve original classification results even in the context of multi-label classification.

The data balancing process, i.e., oversampling, is formulated as an optimization problem. The formulated problem is then solved using numerical methods.

3.1 Detailed description

We make a loose assumption that increasing the frequencies of minority classes will make them better represented in the dataset, hence improving the accuracy of learning algorithms on those classes. To compute frequencies of all classes, we construct a binary representation ${{\lambda}}_{{i}}$ for each data sample in the dataset. ${{\lambda}}_{{i}}$ is a C-dimensional binary vector representing the $i$ -th data sample, where C is the number of classes. The $j$ -th component of ${{\lambda}}_{{i}}$ equals 1 if the $i$ -th sample contains $j$ -th class; otherwise, it equals 0.

Figure 2.

A simple binary representation example. Here, $C=3$ , $\Omega=\{\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4},\lambda_{5},\lambda_{% 6}\}$ , $|\Omega|=6$ , $\pi=(0.5,0.5,0.33)$ , $\hat{\lambda}_{1}=(1,0,0)$ , $\hat{\lambda}_{2}=(0,1,0)$ , $\hat{\lambda}_{3}=(1,1,0)$ , $\hat{\lambda}_{4}=(0,0,1)$ , $\Omega_{u}=\{\hat{\lambda}_{1},\hat{\lambda}_{2},\hat{\lambda}_{3},\hat{% \lambda}_{4}\}$ , $\alpha=(1,1,2,2)$ .

We then construct a set ${\Omega}$ , which contains ${{\lambda}}_{{i}}$ for each data sample in the dataset. A simple example of this representation is given in Fig. 2. The frequencies of all the classes in the dataset can be computed as:

$\displaystyle\pi=\frac{\sum^{|{\Omega}|}_{i=1}{{{\lambda}}_{{i}}}}{|{\Omega}|},$ (1)

where $|{\Omega}|$ is the cardinality of the ${\Omega}$ set and $\pi$ is a vector, the $j$ -th position of which equals the frequency of the $j$ -th class in the dataset.

Increasing the frequency of the $j$ -th class can be executed by adding ${{\lambda}}_{{i}}$ vectors with $j$ -th entry equal to 1 to the ${\Omega}$ set. The problem can be simplified by using only unique ${{\lambda}}_{{i}}$ vectors and multiplying them by a scalar representing the number of their copies in the ${\Omega}$ set. We define a set of unique binary vectors ${\hat{\lambda}}_{{i}}$ taken from ${\Omega}$ and denote it as ${{\Omega}}_{u}$ ( ${\hat{\lambda}}_{{i}}\in{{\Omega}}_{u}\subseteq{\Omega}$ ). Equation (1) can then be rewritten in the following form:

$\displaystyle\pi=\frac{\sum^{|\Omega_{u}|}_{i=1}{{\alpha_{i}\hat{\lambda}}_{{i% }}}}{\sum^{|\Omega_{u}|}_{i=1}{\alpha_{i}}},$ (2)

where ${\hat{\lambda}}_{{i}}\in{{\Omega}}_{u}$ and $\alpha_{i}$ are the number of copies of the ${\hat{\lambda}}_{{i}}$ vector that are present in the ${\Omega}$ set.

With such a formulation, oversampling can be performed by increasing $\alpha_{i}$ coefficients for vectors with entries for minority classes equal to 1. However, we seek a way to perform such oversampling automatically. We suggest an approach based on a probabilistic perspective of the problem that allows us to automatically perform balancing through coordinated oversampling.

We omit the fact that there is only a finite set ${{\Omega}}_{u}$ of ${\hat{\lambda}}_{{i}}$ vectors and assume that ${\hat{\lambda}}_{{i}}$ is a C-dimensional random vector with i.i.d. entries. More precisely, each $j$ -th entry in a ${\hat{\lambda}}_{{i}}$ vector can be treated as a random binomial variable $x_{j}$ with an empirical probability distribution $p=\pi_{j}$ and $q=1-\pi_{j}$ . The probability of all the classes appearing in a data sample, i.e., all entries of a ${\hat{\lambda}}_{{i}}$ vector equal to 1, can be calculated as:

$\displaystyle p_{\textit{all}}=p(x_{1}=1,\dots,x_{C}=1)=\prod{p_{j}}=\prod{\pi% _{j}}$ (3)

We strive for the ideal case of $p_{\textit{all}}=1$ , where all the classes are present in a data sample on average. Oversampling, then, is formulated as finding $\alpha$ that maximizes $p_{\textit{all}}$ the most. To find the optimal $\alpha$ , we perform maximum likelihood estimation by minimizing the average negative log-likelihood (entropy):

$\displaystyle E(\alpha)=-\frac{{1}}{{C}}\text{ln}(p_{\textit{all}})=-\frac{{1}% }{{C}}\sum\text{ln}(\pi_{j}).$ (4)

However, such a problem is ill posed for two reasons:

The range of $\alpha_{i}$ values is not constrained, meaning that a degenerate solution with negative $\alpha_{i}$ can be found during numerical optimization. Such a solution is degenerate because $\alpha_{i}$ is the number of data samples that can be represented by a ${\hat{\lambda}}_{{i}}$ vector, which by definition can only be positive.

In some cases, maximization of $p_{\textit{all}}$ may lead to extremely large $\alpha_{i}$ values which, intuitively, is not a reasonable solution. If the dataset contains a sample that can be described by a vector that consists of ones, the optimal solution can be easily found by assigning a large enough $\alpha_{i}$ to this particular vector. Since $\alpha_{i}$ has the meaning of the number of data samples that can be represented by the ${\hat{\lambda}}_{{i}}$ vector, we wish to avoid the creation of an unreasonably large number of duplicates to preserve data variety as much as possible.

In order to solve the first issue, we constrain $\alpha_{i}$ values via the following reparameterization:

$\displaystyle\alpha_{i}={\alpha^{\textit{init}}_{i}\sigma_{i}=\alpha^{\textit{% init}}_{i}e}^{\beta_{i}}$ (5)

This way, we ensure that $\alpha_{i}$ is always positive and thus solve the initial problem for $\beta$ (vector of $\beta_{i}$ values).

In order to solve the second issue, i.e., constrain the number of generated copies, a second constraint on the $\alpha_{i}$ values is added in the form of the following regularization function:

$\displaystyle R=\sum^{N}_{i}\left(\frac{σ_{i}}{\min_{k}\sigma_{k}}-1\right)^{% 2}w_{i},$ (6) $\displaystyle w_{i}=\frac{\alpha^{\textit{init}}_{i}}{\sum_{k}{\alpha^{\textit% {init}}_{k}}},$ (7)

This regularization function forces the optimal solution to have fewer copies.

The final objective function to be minimized is formulated as:

$\displaystyle L=E+\delta R,$ (8)

where ${E}$ is the entropy function from Eq. (4), ${R}$ is the regularization function from Eq. (6) and $\delta$ is a hyperparameter. It should be noted that if $\beta$ is a solution, then adding a constant vector ${a}$ (all entry values are the same) will yield another solution, meaning that there is an infinite number of solutions to the formulated problem that correspond to the same class distribution. This issue can be solved by adding an appropriate regularization function. However, adding one more constraint on the space of possible solutions makes the overall approach more difficult to work with. Instead, we use the solution that the numerical method has converged to and standardize it in a separate step that mitigates the issue.

Once the optimal solution $\beta$ is found, it is transformed into the corresponding scaling coefficients $\sigma_{i}=e^{\beta_{i}}$ . It might be the case that some of the $\sigma_{i}$ coefficients are less than 1. This means that some of the data samples that correspond to the ${\hat{\lambda}}_{{i}}$ vector are removed. However, we wish to change class frequencies only by adding (duplicating) data samples. In order to do so, the $\sigma_{i}$ coefficients are standardized by dividing each one by the smallest $\sigma_{i}$ . There also might be a case when all $\sigma_{i}$ are larger than 1. This means that redundant data samples were added. The aforementioned standardization allows us to mitigate this redundancy, as well. After the standardization, new $\alpha_{i}$ are obtained using Eq. (5). The obtained $\alpha_{i}$ are then rounded to make them discrete.

The obtained $\alpha_{i}$ must be used to add the corresponding number of duplicates to the dataset to adjust the class frequencies. Depending on the training procedure, one may augment the added duplicates. In our experiments, we simply copied data samples due to the specifics of our data augmentation pipeline.

3.2 Implementation details

Finding the optimal solution analytically is not feasible because the formulated optimization problem is highly non-linear. Therefore, we employ numerical methods, such as the Newton method, to find the optimal $\beta$ . The Newton method uses information about gradients and the hessian of the function to perform optimization. However, the $\alpha_{i}$ parameters are discrete by their definition, which disallows computation of the gradients. During the optimization process, we extend the domain of $\alpha_{i}$ from $\mathbb{N}$ to $\mathbb{R}$ , i.e., $\alpha_{i}$ are treated as continuous variables. Once the method has converged, the obtained $\alpha_{i}$ are rounded to make their values discrete again.

A naive implementation of the Newton method turned out to be numerically unstable, as the hessian of the objective function is often degenerate. In order to improve the issue, an identity matrix multiplied with a positive scalar is added to ensure that the hessian is a positive definite. To make the optimization process less computationally intensive, the hessian is cached and recomputed only every eighth iteration. In all of our experiments, we perform optimization for 512 iterations, and the identity matrix is multiplied by 2.

We compared optimization methods available in the SciPy [43] library to our implementation of the Newton method. Our study has shown that our solution has better convergence.

Gradients and the hessian with respect to $\beta_{i}$ parameters are computed using the PyTorch [44] library, which allows for automatic differentiation. The implementation of the balancing method, including the custom Newton method, is available on GitHub.1

3.3 Applying the proposed method

When applying the proposed method, one first needs to map the dataset into its binary representation. Then, unique binary vectors ${\hat{\lambda}}_{{i}}\in{{\Omega}}_{u}$ must be identified, as well as the number of their copies $\alpha_{i}$ . The computed vector $\alpha$ is then used as the initial solution $\alpha^{\textit{init}}$ of the problem and is also used to compute the value of the regularization function $R$ . The $\beta_{i}$ values are all set to zero.

Once preliminaries are finished, one must find an optimal value for the $\delta$ parameter. The $\delta$ parameter controls the number of duplicates generated in the resulting balanced dataset. The lower the $\delta$ parameter value, the larger the number of duplicates. Intuitively, it is desirable to increase the frequencies of the minority classes. However, it is not clear how much one should increase the frequencies of the minority classes. In practice, obtaining high values for the frequencies of minority classes leads to the generation of a large number of duplicates, affecting the dataset variability, which may cause overfitting in the learning algorithm. Instead of a plain increase in frequencies, we constrain the relative number of duplicates that can be generated and perform balancing within that constraint. We refer to this constraint as oversampling budget (OB). The oversampling budget can be computed as:

$\displaystyle OB=\sum^{N}_{i}\left(\frac{\alpha_{i}}{\alpha^{\textit{init}}_{i% }}-1\right)w_{i}.$ (9)

If OB equals 1, the number of generated duplicates equals the number of data samples in the initial dataset. In this case, the resulting balanced dataset will be twice as large as the initial dataset.

The optimal OB value depends on the specifics of the dataset and the learning algorithm one uses. We recommend generating several versions of the balanced dataset with OB set to 1, 2 and 3. By comparing which option results in the best performance of the learning algorithm, one can find the optimal OB value. The number of generated duplicates is a monotonic function of $\delta$ : as the $\delta$ value decreases, the number of duplicates increases. Therefore, the optimal $\delta$ value for a certain OB can be found using binary search.

Once OB is set and the corresponding $\delta$ is found, one must perform optimization and obtain $\alpha_{i}$ values. The following frequencies of ${\hat{\lambda}}_{{i}}$ must be computed using the obtained $\alpha_{i}$ :

$\displaystyle{\hat{\pi}}_{i}=\frac{\alpha_{i}}{\sum^{|\Omega_{u}|}_{j=1}{% \alpha_{j}}}.$ (10)

The computed ${\hat{\pi}}_{i}$ is the probability of selecting a data sample, which binary representation $\lambda_{{j}}$ equals ${\hat{\lambda}}_{{i}}$ . In our implementation, the ${\hat{\pi}}_{i}$ is interpreted as the probability of choosing an index $i$ of a corresponding ${\hat{\lambda}}_{{i}}$ . Our implementation of this sampling procedure is as follows:

•

Group data samples by ${\hat{\lambda}}_{{i}}$ .

•

Choose an index $i$ of a unique binary vector ${\hat{\lambda}}_{{i}}$ in accordance with the distribution $\pi$ .

•

Choose randomly (using uniform distribution) a data sample that corresponds to the unique binary vector ${\hat{\lambda}}_{{i}}$ with the sampled index $i$ .

To ensure that all the data samples are being used, we choose them in a loop instead of sampling uniformly.

With such implementation, it may not seem obvious when actual oversampling happens, since no explicit copies are created. In this case, oversampling is carried out during training of the learning algorithm by choosing some of the data samples more often than others in accordance with the distribution $\hat{\pi}$ .

One may use the obtained $\alpha_{i}$ values to explicitly create copies of data samples that correspond to the ${\hat{\lambda}}_{{i}}$ vector. However, this data preparation process can sometimes be tedious and time consuming. We experimented with both versions of the data sampling process, and they gave similar results. Therefore, we recommend our implementation of the data sampling process because of its simplicity.

3.4 Comparison to the gradient descent-based method

In our previous work [45], a similar method was presented. Though the general framework remains the same, the new method is much better developed in several regards:

•
The new method is much better theoretically grounded. The objective function is derived using probabilistic principles, whereas the old method uses L2-norm without any theoretical backing for such choice.
•
There was no clear instruction on how to perform data balancing. In contrast, in this paper, we detail the way a dataset must be balanced via the introduction of the oversampling budget (OB) concept, which has an intuitive interpretation of a relative number of samples.
•
The new method presents a much better parame-trization for its parameters, whereas parametrization in the old method may lead to a degenerate solution with negative $\alpha_{i}$ . Due to this, in the case of the old method, special care must be taken when setting a suitable initial $\alpha^{{\textit{init}}}$ and choosing an appropriate number of iterations for the optimization algorithm to avoid negative $\alpha_{i}$ . In contrast, the new method reparametrizes $\alpha_{i}$ through an exponent, which alleviates this issue.
•
The old method had much more hyperparameters, such as target class frequencies, regularization scale and gradient descent optimizer, as well as more optimization hyperparameters, such as number of iterations and the step size. Contrarily, the new method requires one to specify only the OB value, as the rest is already specified (optimizer) or automated (finding optimal regularization scale). The OB value has a natural interpretation of the relative number of duplicates. It is easy to find an optimal value for it, as the range of values is not large and concrete recommendations are given regarding the initial choice of OB.

4. Experiments

In this section, a thorough experimental study is conducted in order to assess the effectiveness of the proposed method. It consists of three subsections: in the first subsection, the proposed method is applied to seven common multi-label datasets, and a direct comparison to existing data balancing approaches is provided; in subsections two and three, the proposed method is applied to two highly unbalanced biomedical image datasets which other approaches cannot be applied to. The results indicate that the proposed method is able to improve the original classification results and presents a competitive alternative to existing data balancing approaches.

4.1 Comparison to existing approaches

In this subsection, a series of experiments is conducted to compare the proposed method to existing data balancing approaches. We compare our method to MLSMOTE [41], MLTL [37] and REMEDIAL [42]. An open-source implementation of the listed data balancing approaches, the Mulan library [47], was used during the experiments.

All of the approaches, including the proposed one, were applied to seven multi-label datasets that are publicly available [47]: corel5k, emotions, flags, scene, bibtex, medical and yeast. Table 1 shows key characteristics of these datasets.

Table 1
Description of the datasets

Dataset name	Number of data samples	Number of classes	Number of features
Corel5k	5000	374	499
Emotions	593	6	72
Flags	194	7	19
Scene	2407	6	294
Bibtex	7395	159	1836
Medical	978	45	1449
Yeast	2417	14	103

Table 2

Class frequencies for the flags dataset

Class	w $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)
#1	0.81	0.67	0.82	0.60	0.91	0.96
#2	0.50	0.52	0.49	0.37	0.70	0.89
#3	0.51	0.59	0.52	0.38	0.66	0.89
#4	0.47	0.44	0.48	0.35	0.73	0.89
#5	0.74	0.75	0.75	0.55	0.79	0.94
#6	0.24	0.35	0.24	0.18	0.53	0.77
#7	0.16	0.25	0.16	0.12	0.54	0.81

4.1.1 Balancing the datasets

In order to balance the data using other approaches, we use the default parameters provided in the Mulan library for the corresponding algorithms.

In the case of the proposed method, we created two balanced versions of the initial datasets: the first one with OB $=$ 1 and the second one with OB $=$ 2.

After balanced versions of the datasets had been created, we computed class frequencies for some of them to compare them to the original class frequencies. Tables 2–4 contain class frequencies for flags, emotions and scene datasets, respectively. Class frequencies obtained with our method are denoted as Our (1) and Our (2) for OB $=$ 1 and OB $=$ 2, respectively. MLSMOTE and REMEDIAL algorithms are referred to as MLS and REM, respectively, in the tables. We do not provide class frequencies for the other datasets since they have too many classes (up to 374 classes). Nevertheless, the given tables do provide a general picture regarding the influence of the algorithms on class frequencies.

Table 3
Class frequencies for the emotions dataset

Class	w $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)
#1	0.30	0.44	0.30	0.22	0.36	0.45
#2	0.27	0.34	0.27	0.20	0.36	0.41
#3	0.43	0.36	0.43	0.31	0.54	0.49
#4	0.23	0.31	0.23	0.17	0.4	0.38
#5	0.24	0.20	0.24	0.18	0.38	0.36
#6	0.34	0.33	0.33	0.25	0.31	0.39

Table 4

Class frequencies for the scene dataset

Class	w $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)
#1	0.19	0.15	0.18	0.18	0.29	0.29
#2	0.14	0.11	0.14	0.13	0.21	0.24
#3	0.16	0.22	0.16	0.16	0.23	0.23
#4	0.16	0.13	0.16	0.15	0.23	0.32
#5	0.23	0.19	0.23	0.22	0.26	0.25
#6	0.18	0.25	0.18	0.18	0.25	0.24

It should be noted that some of the datasets, such as emotions and scene, do not seem to have a pronounced imbalance. Still, those datasets are often used in the related literature, so we include them in our experiments to provide a more detailed comparison to other methods.

4.1.2 Learning algorithms

Two learning algorithms were trained on the datasets: decision tree and multilayer perceptron. We used implementation provided by the Sci-Kit Learn library [50]. During the experiments, the default values of the hyperparameters were used.

4.1.3 Results

In this section, we present the results obtained after training a decision tree and a multilayer perceptron on both balanced and unbalanced versions of the datasets. The results were averaged across three runs of training the learning algorithms.

Table 5.1
Precision

	Multilayer perceptron						Decision tree
Dataset	W $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)	Wø	MLS	MLTL	REM	Our (1)	Our (2)
Corel5k	0.263	0.254	0.273	0.267	0.233	0.198	0.152	0.144	0.153	0.137	0.156	0.159
Emotions	0.702	0.658	0.685	0.664	0.648	0.635	0.540	0.406	0.543	0.651	0.546	0.592
Flags	0.601	0.662	0.604	0.603	0.628	0.629	0.654	0.653	0.652	0.619	0.671	0.652
Scene	0.791	0.750	0.783	0.787	0.753	0.755	0.592	0.458	0.576	0.593	0.538	0.569
Bibtex	0.502	0.465	0.329	0.539	0.479	0.461	0.346	0.321	0.297	0.322	0.346	0.351
Medical	0.754	0.750	0.694	0.761	0.777	0.767	0.758	0.734	0.666	0.713	0.708	0.708
Yeast	0.615	0.605	0.616	0.656	0.596	0.573	0.507	0.520	0.521	0.532	0.510	0.507

Table 5.2

Recall

	Multilayer perceptron						Decision tree
Dataset	W $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)	W $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)
Corel5k	0.152	0.169	0.139	0.108	0.175	0.182	0.144	0.126	0.174	0.058	0.152	0.157
Emotions	0.455	0.383	0.454	0.226	0.664	0.697	0.521	0.321	0.520	0.306	0.520	0.571
Flags	0.584	0.624	0.581	0.376	0.780	0.754	0.654	0.618	0.657	0.332	0.674	0.659
Scene	0.681	0.594	0.683	0.650	0.707	0.704	0.579	0.444	0.558	0.541	0.518	0.537
Bibtex	0.318	0.294	0.131	0.216	0.324	0.327	0.328	0.291	0.218	0.166	0.332	0.330
Medical	0.345	0.326	0.229	0.262	0.422	0.436	0.724	0.688	0.664	0.655	0.701	0.703
Yeast	0.560	0.542	0.562	0.364	0.556	0.574	0.502	0.498	0.520	0.343	0.507	0.512

Table 5.3

Weighted F-score

	Multilayer perceptron						Decision tree
Dataset	W $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)	W $\backslash$ o	MLS	MLTL	REM	Our (1)	Our (2)
Corel5k	0.171	0.181	0.167	0.141	0.182	0.174	0.127	0.113	0.150	0.075	0.145	0.154
Emotions	0.531	0.422	0.522	0.319	0.651	0.661	0.528	0.351	0.528	0.373	0.529	0.579
Flags	0.567	0.610	0.568	0.437	0.684	0.677	0.653	0.633	0.653	0.426	0.669	0.654
Scene	0.729	0.637	0.727	0.709	0.727	0.726	0.585	0.449	0.565	0.565	0.526	0.548
Bibtex	0.371	0.345	0.160	0.283	0.372	0.370	0.333	0.301	0.229	0.196	0.335	0.336
Medical	0.447	0.427	0.321	0.365	0.529	0.539	0.729	0.700	0.659	0.668	0.696	0.699
Yeast	0.558	0.547	0.558	0.427	0.568	0.570	0.504	0.508	0.520	0.409	0.508	0.509

Precision, recall and weighted F-score were used to assess the classification results. A special emphasis must be put on the recall metric, as it is a direct measure of how sensitive a learning algorithm is to a particular class. In other words, higher recall for a certain class means more samples of this class are recognized, which is of high importance in the case of minority classes. The measurements for precision, recall and weighted F-score are presented in Tables 5.1–5.3, respectively. The best values are highlighted in bold. We also use italics to denote values that are better than those obtained with original (unbalanced) versions of the datasets. Similar to Tables 2–4, our method is referred to as Our (1) and Our (2) for OB $=$ 1 and OB $=$ 2, respectively. MLSMOTE and REMEDIAL algorithms are referred to as MLS and REM, respectively.

Observing the results in Tables 5.2 and 5.3, we can conclude that the proposed method yields a consistent gain in recall and F1-measure in most of the cases compared to the results obtained with original datasets, except for the scene dataset, although in some cases, this gain in recall is traded off for precision. An increase in the number of data samples containing minority classes seem to increase the sensitivity of learning algorithms on those classes. Nevertheless, no new data (new information about classes) are being added, meaning that the classes are still purely represented, which may disallow the learning algorithm from improving its ability to distinguish minority classes from other classes, which explains the occasional drop in precision. Still, the observed drop in precision, if it happens, does not seem to be dramatic and can be tolerated considering gains in recall.

Compared to other approaches, the proposed method is consistently better in the case of a multilayer perceptron. In the case of a decision tree, MLTL performs similarly to the proposed method, whereas the other algorithms give considerably worse results.

Overall, experimental results indicate that the proposed method allows us not only to improve original classification results, but also performs comparatively better than the other algorithms in most of the cases.

4.2 Biomedical image segmentation: fundus dataset

In this section, we describe an experiment in which the proposed method was applied to the problem of biomedical image segmentation. The dataset was balanced using the proposed method and then used to train a convolutional neural network (CNN) to perform image segmentation. The results obtained with the proposed method are compared to those of manually balanced data and the original (unbalanced) data. No results were obtained for the other data balancing approaches since they cannot be applied to this kind of data due to the specifics of their data balancing process. We address the question of application of the other methods to image datasets in Section 4.2.3.

4.2.1 Fundus images dataset

A dataset of fundus images was used for the experiment. The dataset consisted of 115 images, each labeled with a segmentation mask. There were eight classes in total, excluding background. The dataset was provided by Samara Regional Clinical Ophthalmological Hospital, named after Eroshevsky [44], which is not publicly available. The following eight classes were used: optic disk (1), macula (2), blood vessels (3), retinal hemorrhage (4), soft exudate (5), new coagulates (6), pigmented coagulates (7) and solid exudate (8). This dataset was used for training a CNN, the input of which is a fundus image and the output is a corresponding segmentation mask.

4.2.2 Cross-validation sets

The presented dataset was severely imbalanced, since classes 5, 6 and 7 were roughly 10 times less frequent compared to classes 1–3. Only a few images contained minority classes, meaning that the results of accuracy evaluation may not be representative. Figure 3 shows the class distribution among all samples; minority classes are marked with dashed lines. In order to tackle this problem, k-fold cross-validation was used. Three folds with nonoverlapping validation sets were created to ensure independence of the validation results. Each validation set consisted of 16 images and each training set consisted of 99 images. Images for the validation sets were chosen in a such way that the validation sets’ class frequencies were roughly the same. The final evaluation results were obtained by averaging among the three folds.

Figure 3.

Numbers of images that contain the i-th class.

4.2.3 Balancing the dataset

Each of the folds were balanced using the proposed method and via manual oversampling. In both cases, OB was set to 2.

Due to the fact that each sample contained several classes, it was difficult to perform balancing manually. Applying the oversampling technique was challenging, since copying samples containing the desirable minority classes affected the distribution of all the other classes. In particular, copying samples containing class 6 resulted in lowering the frequency of class 7, which may lead to worse performance on this class. Undersampling was not an option since the dataset was too small and any loss of data was intolerable.

The proposed method allowed us to easily increase class frequencies, where minority classes were not as rare. In our experience, balancing using our method took no longer than several minutes. Although the optimization process is computationally intensive due to computation of an inverse hessian, it takes an insignificant amount of time compared to manual oversampling. It should be noted that the overall data balancing process is extremely simple, as only the OB value must be specified. Once the OB is set, the optimal value for the $\delta$ parameter is automatically found using binary search. In contrast, balancing with manual oversampling was challenging and took about an hour per data fold. We expect it to be even longer for datasets with a larger number of classes.

Methods such as MLSMOTE, REMEDIAL, ADASYN and MLTL are not applicable in the case of medical imaging, since those methods perform balancing via new data generation. New data samples are generated via interpolation of existing data samples. Such a data generation process makes an assumption that interpolation between neighboring data samples creates a new valid data sample, whereas in the image context, this is not true. Interpolation between images does not yield a valid image. Furthermore, such interpolation may not be possible at all when images are of different sizes. On the contrary, our method does not make any assumptions about the underlying structure of the data and works by solely duplicating some of the data samples, which makes it a general approach to balancing multi-label data or any data that can be described in a multi-label framework from Section 3.1.

Table 6
Average class frequencies for different cross-validation folds (training set)

Class	Without balancing (original)	Balanced (oversampling)	Balanced (our method)
#1	100%	100%	100%
#2	97%	99%	96%
#3	98%	99%	99%
#4	90%	95%	93%
#5	37%	51%	46%
#6	6%	34%	14%
#7	7%	32%	19%
#8	95%	98%	98%

The results of data balancing are shown in Table 6. The table contains average class frequencies for training sets of each fold obtained with our method, manual oversampling, as well as class frequencies of unbalanced folds.

4.2.4 Metrics

In order to assess the performance of the trained neural network, the conventional metrics such as recall or precision are not well suited, since some of the classes occupy very little area in images. The occupied area percentage for each class is shown in Table 7. Therefore, apart from precision, recall and F-score, we report per class Dice [42] values, which is a common metric employed in biomedical applications. It is well suited for cases when some classes occupy little area in images.

Table 7
Percentage of occupied image area by every class in the dataset

Class	Occupied area percentage
#1	83.1%
#2	2.1%
#3	1.0%
#4	10.8%
#5	1.1%
#6	0.2%
#7	0.1%
#8	1.3%

It is worth noting that the issue with little area adds a new layer of complexity to the problem in the form of “intra-class” imbalance, when some classes are not only underrepresented in the dataset as a whole, but are also underrepresented within a single data sample (occupy only a few pixels of an image). Though it is not explicit, the proposed algorithm works with an assumption that every class is equally well represented within individual data samples and that each class is equally difficult to learn. Addressing this limitation will be the subject of our future research. Nevertheless, the experimental results show that the proposed method is able to bring improvements even in such a complex setting.

4.2.5 Data augmentation

Biomedical specificity does not allow one to obtain large amounts of data. Hence, datasets are often small and unbalanced. The dataset used in our experiment consisted of only 115 images, and aggressive data augmentation was required to prevent the neural network from overfitting to the data.

The data were augmented dynamically during training using random flips and rotations [51] in the range of [ $-$ 30, 30] degrees. These data augmentation techniques proved to be effective in computer vision problems, allowing overfitting to be reduced.

Biological tissue is prone to deformations. In order to make the network invariant to such corruptions, elastic image deformation was used, as it is an effective technique commonly employed in biomedical applications. An example of applying elastic augmentation to an image is shown in Figs 4 and 5.

It should be noted that the data augmentation techniques used here may affect classes within the data samples. For example, elastic image deformation may change the area that is occupied by some classes within the image. However, our method does not take this into consideration and assumes that within a data sample, each class is equally well represented.

Table 8
Results of the experiment on fundus data. First half of the table contains Dice values for the experiment with no data augmentation. The second half is analogous but for the experiment with data augmentation. The best values are highlighted in bold

Class	No augmentation			With augmentation
	Without balancing	Manual oversampling	Our method	Without balancing	Manual oversampling	Our method
#1	0.950	0.947	0.952	0.957	0.956	0.957
#2	0.814	0.778	0.797	0.827	0.806	0.830
#3	0.586	0.588	0.614	0.623	0.620	0.657
#4	0.706	0.725	0.732	0.742	0.723	0.746
#5	0.504	0.501	0.527	0.542	0.522	0.553
#6	0.186	0.246	0.276	0.230	0.288	0.229
#7	0.000	0.010	0.012	0.045	0.027	0.077
#8	0.036	0.022	0.004	0.128	0.071	0.120
Mean	0.473	0.484	0.494	0.504	0.511	0.528

Table 9

Results of the experiment on fundus data. Contains average values for recall, precision and F-score. The best values are highlighted in bold

Metric	No augmentation			With augmentation
	Without balancing	Manual oversampling	Our method	Without balancing	Manual oversampling	Our method
Recall	0.442	0.473	0.446	0.463	0.466	0.481
Precision	0.552	0.521	0.554	0.561	0.568	0.557
F-score	0.469	0.483	0.473	0.490	0.498	0.499

To analyze the effects of augmentation on final results, we conducted a second experiment in which the CNN was trained with no augmentation.

Figure 4.

Original fundus image.

Figure 5.

Elastic augmented fundus image.

4.2.6 Semantic segmentation using neural network

In order to perform image segmentation, a CNN was built and trained. U-Net [52] was chosen as the base architecture of the neural network, as it was built specifically for biomedical segmentation purposes. Xception-65 [53] pre-trained on the ImageNet dataset [54] was used as a feature extractor. The network was trained using the Adam optimizer [46] and the learning rate was set to 0.0045. FocalLoss [29] was used as a loss function as it was proved to be effective in class imbalance settings. The network was trained for 2000 iterations, and the batch size was set to eight images.

4.2.7 Results

Tables 8 and 9 show that training on the data balanced by the proposed method data yielded better results compared to training on the original data and manually balanced data. Better values are indicated with bold. Similar to Tables 5.1–5.3, we use italics to denote results that are better than those obtained with original (unbalanced) data.

Figure 6.

Predicted binary mask for 4th class.

An example of the CNN’s predictions is shown in Fig. 6. Figure 7 shows the corresponding ground-truth segmentation mask, giving an effective illustration of how the proposed method increases the sensitivity of a learning algorithm, corresponding with the experimental results presented in Section 4.1.3.

The proposed method outperforms manual oversampling for every minority class (5, 6, 7) in the experiment where no augmentation was used. However, the Dice is worse for classes 2 and 8. The reason for this sudden drop seems to be that the network is prone to memorize those particular classes in the presence of a large number of duplicates. It can be seen from the results of the experiment with data augmentation that it is crucial to perform augmentation to prevent the CNN from overfitting on those particular classes. The proposed method does not take into account such specifics of individual classes and assumes that every class is equally well represented in a data sample and is equally difficult to learn by a learning algorithm. We will be addressing this particular issue in our future research.

In the case of using data augmentation, balancing using the proposed method still outperforms the other two in terms of mean Dice. Dice, for most of the classes, is higher compared to both manual oversampling and the original (unbalanced) dataset.

An important factor to note is that data balancing via oversampling was challenging and time consuming, yet it did not yield considerably better results compared to the unbalanced version. On the contrary, it was easy to balance data using the proposed method, which also yielded better results compared to both manual oversampling and the original dataset.

The overall results suggest that the proposed method allows us to reduce the influence of class imbalance in an image dataset on the learning algorithm. Although this setting is much more complex since labels are segmentation masks instead of simple binary vectors and classes are not represented equally well among data samples, the proposed method still allows us to improve upon original segmentation results.

4.3 Biomedical image segmentation: BCSS dataset

In this section, we describe a similar experiment performed on the Breast Cancer Semantic Segmentation (BCSS) dataset, which was first introduced in [55]. The dataset was balanced using the proposed method and then used to train a convolutional neural network (CNN) to perform image segmentation. Similar to the previous experiment, the results obtained with the proposed method are compared to those of manually balanced data and the original unbalanced data. No results were obtained for the other data balancing approaches since they cannot be applied to this kind of data due to the specifics of their data balancing process.

Figure 7.

Ground-truth binary mask for 4th class.

In this subsection, we detail only cross-validation info, data preprocessing and balancing of the dataset. The rest of the experimental settings were similar to those described in Sections 4.2.4, 4.2.5 and 4.2.6.

4.3.1 Breast Cancer Semantic Segmentation (BCSS) dataset

The BCSS dataset consists of 151 images of cancer tissue and contains 21 classes in total. In our experiments, only eight classes were used; the rest of the classes were labeled as background. The list of the classes is as follows: background (0), tumor (1), stroma (2), lymphocytic infiltrate (3), necrosis or debris (4), blood (5), exclude (6), fat (7) and plasma cells (8). Similar to the fundus dataset, this dataset was used for training a CNN, the input of which is a cancer tissue image and the output is a corresponding segmentation mask.

4.3.2 Cross-validation sets

Similar issues are encountered in the case of the BCSS dataset: some of the classes are extremely underrepresented and are contained only within a few samples. Figure 8 shows class distribution among all samples in the dataset; minority classes are marked with dashed lines.

To ensure representativeness of the CNN’s evaluation, k-fold cross-validation was used. Similarly, three folds with nonoverlapping validation sets were created to ensure independence of the validation results. Each validation set consisted of 16 images and each training set consisted of 135 images. The final evaluation results were obtained by averaging among the three folds.

4.3.3 Data preprocessing

In contrast to the fundus dataset, images in the BCSS dataset vary in sizes, ranging from 1000 pixels to 9000 pixels in width and height. To solve the issue, each image was sliced into nonoverlapping pieces of 900 $\times$ 900 pixels. Pieces that did not fit the aforementioned size were padded with zeros.

This procedure was applied to both training and validation sets. After preprocessing, each fold contained roughly 2150 training images and 220 validation images.

Figure 8.

Numbers of images that contain the i-th class.

4.3.4 Balancing the dataset

The BCSS dataset was balanced in a similar manner to that of the fundus dataset. Each of the folds were balanced using the proposed method and via manual oversampling. In both cases, OB was set to 2.

In contrary to the fundus dataset, manual oversampling of the BCSS dataset was much harder since its training sets contained 2150 images, whereas in the fundus dataset, there was only 99. During balancing, a similar issue was encountered as in the case of the fundus dataset – increasing the frequency of one minority class led to a decrease in frequencies of other minority classes. In our experience, manual balancing of a single data fold took up to several hours. It should be noted that careful manual data balancing on the scale of thousands of images becomes unfeasible and, as the results will show, may even harm the performance of the learning algorithm in some cases.

As in the case of the fundus dataset, applying the proposed algorithm was easy and allowed us to balance the BCSS dataset in less than 10 minutes.

Table 10
Average class frequencies for different cross-validation folds (training set)

Class	Without balancing (original)	Balanced (oversampling)	Balanced (our method)
#0	6%	16%	19%
#1	81%	68%	72%
#2	96%	95%	97%
#3	43%	53%	53%
#4	24%	27%	23%
#5	10%	24%	21%
#6	31%	37%	33%
#7	15%	41%	22%
#8	9%	20%	19%

Table 11

Percentage of occupied image area by every class in the dataset

Class	Occupied area percentage
#0	0.34%
#1	35.1%
#2	31.1%
#3	9.2%
#4	5.7%
#5	0.2%
#6	2.4%
#7	1.6%
#8	1.6%

Table 12

Results of the experiment on BCSS data. The best values are highlighted in bold

Class	Without balancing	Manual oversampling	Our method
#0	0.000	0.000	0.001
#1	0.647	0.631	0.661
#2	0.578	0.556	0.619
#3	0.323	0.284	0.288
#4	0.171	0.131	0.152
#5	0.020	0.049	0.043
#6	0.113	0.098	0.112
#7	0.145	0.154	0.169
#8	0.017	0.023	0.026
Mean	0.224	0.214	0.230

Table 13

Results of the experiment on BCSS data. Average values for recall, precision, F-score and AUC

Metric	Without balancing	Manual oversampling	Our method
Recall	0.173	0.154	0.192
Precision	0.206	0.210	0.223
F1	0.159	0.149	0.185

Since the BCSS dataset consists of images, the application of methods such as MLSMOTE, REMEDIAL, ADASYN and MLTL was not an option. These algorithms rely on data interpolation in order to generate new data samples, and in the case of images, this leads to the generation of degenerate data.

The results of data balancing are shown in Table 10. It contains average class frequencies for training sets of each fold obtained with our method, manual oversampling as well as class frequencies of unbalanced folds. We also provide information about the area each class occupies in an image in Table 11.

4.3.5 Results

Tables 12 and 13 show that the proposed method performs better on average compared to the results obtained with manual oversampling and original (unbalanced) data. The best values are indicated in bold; italics are used to denote results that are better than those obtained with original (unbalanced) data. Although gains are not very pronounced, the proposed method does bring an improvement in most of the classes. It should be noted that with the proposed method, the CNN is able to recognize the background class (class 0). However, a drop in Dice values for some classes can be observed. From this, we can conclude that increasing the frequency of a class may not necessarily lead to improvements in this class. As in the case of the fundus dataset, the proposed method may trade off the classification quality of some classes for other classes. We relate such behavior to the classes’ dynamics, which our method does not consider during balancing. Our future research will be devoted to addressing this issue.

As for manual oversampling, it performs worse on average even compared to unbalanced data. Though careful balancing was attempted, no considerable improvement was attained. This shows the importance of automatic balancing in cases with large numbers of data samples. The proposed method not only allows us to perform data balancing quickly, but also executes it more optimally compared to manual balancing, without hurting the algorithm’s classification performance on average.

The overall results suggest that the proposed method does bring an improvement to the original classification results, yet it may not be the case for some classes. The method may trade off performance in some classes for other classes, but this happens in a minority of classes, and in some cases, may be tolerated considering gains.

5. Conclusion

In this paper, we proposed a method for biomedical multi-label data balancing. The method can be applied in the case of semantic segmentation problems for balancing the corresponding image data. The other key feature of the proposed method is that it does not make any assumptions regarding the underlying structure of the data to be balanced. In other words, the proposed method can be applied to multi-label data or other data that can be described using a multi-label framework, making it general and widely applicable.

A thorough experimental study of the proposed method was conducted. Firstly, to support our claim about the wide applicability of the proposed method, we applied our method to seven common multi-label datasets and compared the results to those of other multi-label data balancing approaches, such as MLTL, MLSMOTE and REMEDIAL. The results of the experiments show that the proposed method allows us to bring a consistent improvement in classification results in almost all cases, making it a competitive alternative to existing approaches. Secondly, we applied the proposed method to two highly unbalanced biomedical image datasets, to which other data balancing approaches could not be applied. The initial biomedical image segmentation problem was reformulated using a multi-label framework in order to perform balancing of the image data. The results obtained with the proposed method were compared to those obtained for a manually balanced version of the datasets as well as their original (unbalanced) versions. The experimental results show that the method allows us to reduce the influence of class imbalance on the learning algorithm, improving its original classification results.

However, it should be noted that the improvements may not be achieved for all the classes. This is related to how classes behave within a data sample and the dataset as a whole. The proposed method is not sensitive to such dynamics, assuming that each class is equally well represented within a data sample and is equally hard to learn, which can be considered a drawback. Future research will be devoted to addressing this issue. One possible research direction is to incorporate a form of weighing individual data samples that would indicate how well a certain data sample represents a particular class. Another research direction is to use the proposed method in conjunction with more sophisticated and advanced learning algorithms, such as [56, 57, 58, 59]. Nevertheless, drops in performance in other classes can be tolerated considering the gains. It also should be noted that balancing using the proposed method is fast, especially in comparison to manual balancing, which may take up to several hours.

The overall results indicate that the proposed method can be successfully used to balance multi-label data. It allows us to perform data balancing quickly and still attain comparable or better results compared to manual oversampling, which may be difficult and time consuming. It does not rely on feature space to perform oversampling, unlike MLSMOTE, MLTL and REMEDIAL, which makes it superior in cases when such reliance is impossible (for example, image data).

Footnotes

Project is available via the following link: https://github.com/ MakiResearchTeam/MultiClassLabelBalancing.

Acknowledgments

This work was funded by the Russian Foundation for Basic Research under RFBR grant # 19-29-01135 and the Ministry of Science and Higher Education of the Russian Federation within a government project of Samara University and FSRC “Crystallography and Photonics” RAS.

References

Wang

Feng

Chen

Zhang

. Acrophobia quantified by EEG based on CNN incorporating Granger causality. International Journal of Neural Systems. 2021; 31(03): 2050069.

Vadim

Krivorotov

Markelov

Kotlyarova

. Semantic segmentation of satellite images of airports using convolutional neural networks. Computer Optics. 2020; 44(4): 636-645.

. Reachability analysis of neural masses and seizure control based on combination convolutional neural network. International Journal of Neural Systems. 2020; 30(01): 1950023.

Liu

Zhou

Geng

. Automatic seizure detection based on S-Transform and deep convolutional neural network. International Journal of Neural Systems. 2020; 30(04): 1950024.

Acharya

Hagiwara

Tan

Adeli

. Deep convolutional neural network for the automated detection of seizure using EEG signals. Computers in Biology and Medicine. 2018; 270-278.

. Reachability analysis of neural masses and seizure control based on combination convolutional neural network. International Journal of Neural Systems. 2020; 30(1): 1950023.

Lin

Ouyang

Yang

Chiang

. Alternative diagnosis of epilepsy in children without epileptiform discharges using deep convolutional neural networks. International Journal of Neural Systems. 2020; 30(5): 185006.

Thomas

Jin

Thangavel

Bagheri

Yuvaraj

Dauwels

Rathakrishnan

Halford

Cash

Westover

. Automated detection of interictal epileptiform discharges from scalp electroencephalograms by convolutional neural networks. International Journal of Neural Systems. 2020; 30(11): 2050030.

Acharya

Hagiwara

Tan

Adeli

Subha

. Automated EEG-based screening of depression using deep convolutional neural network. Computer Methods and Programs in Biomedicine. 2018; 161: 103-113.

10.

Nogay

Adeli

. Detection of epileptic seizure using pre-trained deep convolutional neural network and transfer learning. European Neurology. 2020; 83(6): 602-614.

11.

Wang

Yan

. Human gait recognition based on frame-by-frame gate energy images and convolutional long short term memory. International Journal of Neural Systems. 2020; 30(1): 1950027.

12.

Meng

Wei

Gao

Zhao

Yang

Huang

Zheng

. CNN-GCN aggregation enabled boundary regression for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. 2020; pp. 352-362.

13.

Chen

Wang

. Dense gate network for biomedical image segmentation. International Journal of Computer Assisted Radiology and Surgery. 2020; 15(8): 1247-1255.

14.

Thanh

Hai

Tiwari

Prasath

. Skin lesion segmentation method for dermoscopic images with convolutional neural networks and semantic segmentation. Computer Optics, 2021; 45(1): 122-129.

15.

Chawla

Japkowicz

Kotcz

. Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter. 2004; 6(1): 1-6.

16.

Zhang

Zhao

Chen

Zhang

Qiu

. Problem of data imbalance in building energy load prediction: Concept, influence, and solution. Applied Energy. 2021; 297: 117139.

17.

Zhang

Yuan

Luan

Jia

Meng

Song

. Re-weighted interval loss for handling data imbalance problem of end-to-end keyword spotting. In INTERSPEECH. 2020; 108: 2567-2571.

18.

Fan

Sun

Chen

Jiang

Liu

Zhao

Meng

Dai

Chen

. EEG data augmentation: Towards class imbalance problem in sleep staging tasks. Journal of Neural Engineering. 2020; 17(5): 056017.

19.

Wang

Liu

Cao

Meng

Kennedy

. Training deep neural networks on imbalanced data sets. In 2016 International Joint Conference on Neural Networks (IJCNN). 2016; pp. 4368-4374.

20.

Buda

Maki

Mazurowski

. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks. 2020; 106: 249-259.

21.

Thabtah

Hammoud

Kamalov

Gonsalves

. Data imbalance in classification: Experimental evaluation. Information Sciences. 2020; 513: 429-441.

22.

Sahu

Mukhopadhyay

Szengel

Zachow

. Addressing multi-label imbalance problem of surgical tool detection using CNN. International Journal of Computer Assisted Radiology and Surgery. 2017; 12(6): 1013-1020.

23.

Ishwaran

O’Brien

. Commentary: The problem of class imbalance in biomedical data. J Thorac Cardiovasc Surg. 2020; 1: 2.

24.

Khushi

Shaukat

Alam

Hameed

Uddin

Luo

Yang

Reyes

. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access. 2021; 9: 109960-75.

25.

Yap

Rani

Rahman

HAA

Fong

Khairudin

Abdullah

. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). – Springer, Singapore. 2014; pp. 13-22.

26.

Giraldo-Forero

, et al. Managing imbalanced data sets in multi-label problems: a case study with the SMOTE algorithm. Iberoamerican Congress on Pattern Recognition. – Springer, Berlin, Heidelberg. 2013; 334-342.

27.

Charte

Rivera

del Jesus

Herrera

. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing. 2015; 163: 3-16.

28.

Sun

Meng

Liang

. Dice loss for data-imbalanced NLP tasks. arXiv preprint arXiv:1911.02855. 2019.

29.

Lin

Goyal

Girshick

Dollár

. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision. 2017; pp. 2980-2988.

30.

Zhou

Liu

. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Transactions on Knowledge and Data Engineering. 2005; 18(1): 63-77.

31.

Shi

. Improvement of Learning Algorithm for the Multi-instance Multi-label RBF Neural Networks Trained with Imbalanced Samples. J. Inf. Sci. Eng. 2013; 29(4): 765-776.

32.

Fernandes

de Carvalho

Yao

. Ensemble of classifiers based on multiobjective genetic sampling for imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2019; 32(6): 1104-1115.

33.

Sun

Song

Zhu

Sun

Zhou

. A novel ensemble method for classifying imbalanced data. Pattern Recognition. 2015; 48(5): 1623-1637.

34.

Wang

Zhang

. Deep Attention-Based Imbalanced Image Classification. IEEE Transactions on Neural Networks and Learning Systems. 2021.

35.

Pes

. Learning from high-dimensional biomedical datasets: the issue of class imbalance. IEEE Access. 2020; 8: 13527-13540.

36.

Zhang

Yang

Jiang

. Imbalanced biomedical data classification using self-adaptive multilayer ELM combined with dynamic GAN. BioMed Eng OnLine 17. 2018; 181.

37.

Pereira

Costa

Silla Jr

. MLTL: A multi-label approach for the Tomek Link undersampling algorithm. Neurocomputing. 2020; 383: 95-105.

38.

Devi

Purkayastha

. Redundancy-driven modified Tomek-link based undersampling: A solution to class imbalance. Pattern Recognition Letters. 2017; 93: 3-12.

39.

Bai

Garcia

. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008; pp. 1322-1328.

40.

Chawla

Bowyer

Hall

Kegelmeyer

. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002; 16: 321-357.

41.

Charte

Rivera

del Jesus

Herrera

. MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems. 2015; 89: 385-397.

42.

Charte

Rivera

del Jesus

Herrera

. Dealing with difficult minority labels in imbalanced mutilabel data sets. Neurocomputing. 2019; 326: 39-53.

43.

SciPy.org [homepage on the Internet]. Available from: https://www.scipy.org/.

44.

PyTorch: an open-source machine learning framework that accelerates the path from research prototyping to production deployment. [homepage on the Internet]. Available from: https://pytorch.org/.

45.

Mukhin

Kilbas

Paringer

Ilyasova

. Application of the gradient descent for data balancing in diagnostic image analysis problems. In 2020 International Conference on Information Technology and Nanotechnology (ITNT). 2020; pp. 1-4.

46.

Kingma

. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

47.

Tsoumakas

Katakis

en Vlahavas

. Mining multi-label data. Data mining and knowledge discovery handbook. Springer. 2009; 667-685.

48.

Pedregosa

, et al. Scikit-learn: Machine learning in python. Journal of Machine Learning Research. 2011; 12: 2825-2830.

49.

Samara Clinical Ophthalmic Eroshevsky Hospital [homepage on the Internet]. Available from: www.visiology.ru/.

50.

Milletari

Navab

Ahmadi

. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV). 2016; pp. 565-571.

51.

Bloice

Stocker

Holzinger

. Augmentor: an image augmentation library for machine learning. arXiv preprint arXiv:1708.04680. 2017.

52.

Ronneberger

Fischer

Brox

. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. 2015; pp. 234-241.

53.

Chollet

. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017; pp. 1251-1258.

54.

Deng

Dong

Socher

Fei-Fei

. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009; pp. 248-255.

55.

Amgad

Elfandy

Hussein

Atteya

Elsebaie

Abo Elnasr

Sakr

Salem

Ismail

Saad

Ahmed

. Structured crowdsourcing enables convolutional segmentation of histology images. Bioinformatics. 2019; 35(18): 3461-3467.

56.

Ahmadlou

Adeli

. Enhanced probabilistic neural network with local decision circles: A robust classifier. Integrated Computer-Aided Engineering. 2010; 17(3): 197-210.

57.

Rafiei

Adeli

. A new neural dynamic classification algorithm. IEEE Transactions on Neural Networks and Learning Systems. 2017; 28(12): 3074-3083.

58.

Pereira

Piteri

Souza

Papa

Adeli

. FEMa: A finite element machine for fast learning. Neural Computing and Applications. 2020; 32(10): 6393-6404.

59.

Alam

Siddique

Adeli

. A dynamic ensemble learning algorithm for neural networks. Neural Computing with Applications. 2020; 32(10): 8675-8690.

A method for balancing a multi-labeled biomedical dataset

Abstract

Keywords

1. Introduction

3.1 Detailed description

3.3 Applying the proposed method

4.1 Comparison to existing approaches

Table 1 Description of the datasets

Table 3 Class frequencies for the emotions dataset

4.1.3 Results

Table 5.1 Precision

4.2.1 Fundus images dataset

4.2.2 Cross-validation sets

Table 6 Average class frequencies for different cross-validation folds (training set)

Table 7 Percentage of occupied image area by every class in the dataset

Table 8 Results of the experiment on fundus data. First half of the table contains Dice values for the experiment with no data augmentation. The second half is analogous but for the experiment with data augmentation. The best values are highlighted in bold

4.2.7 Results

4.3.2 Cross-validation sets

4.3.3 Data preprocessing

Table 10 Average class frequencies for different cross-validation folds (training set)

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
Description of the datasets

Table 3
Class frequencies for the emotions dataset

Table 5.1
Precision

Table 6
Average class frequencies for different cross-validation folds (training set)

Table 7
Percentage of occupied image area by every class in the dataset

Table 8
Results of the experiment on fundus data. First half of the table contains Dice values for the experiment with no data augmentation. The second half is analogous but for the experiment with data augmentation. The best values are highlighted in bold

Table 10
Average class frequencies for different cross-validation folds (training set)