Abstract
In this paper, we propose a data balancing method for multi-label biomedical data. The method can be applied in the case of semantic segmentation problems for balancing the corresponding image data. The proposed method performs oversampling of instances of minority classes in a way that increases the frequencies of appearance (a ratio of number of samples, containing this class, over the total number of samples in the dataset) of minority classes in the data, thereby reducing the class imbalance. The effectiveness of the proposed method is shown experimentally by applying it to two highly unbalanced biomedical image datasets. A convolutional neural network (CNN) was trained on several versions of those datasets: one balanced with the proposed method, another balanced with manual oversampling and an unbalanced version. The results of the experiments validate the effectiveness of the proposed method, proving that it allows the influence of class imbalance on the learning algorithm to be reduced, thus improving its original classification results for most of the classes. Apart from biomedical image data, the proposed method was applied to several common multi-label datasets. Inherently, the proposed method does not make any assumptions about the underlying structure of the data to be balanced; therefore, it can be applied to all types of data (vectors, images, etc.) that can be described in a multi-label framework. It also can be used in conjunction with any learning algorithm that is suitable for multi-label data. To illustrate its wider applicability, a series of experiments was conducted using seven common multi-label datasets. An experimental comparison to existing multi-label data balancing approaches is provided, as well. The experimental results show that the proposed method presents a competitive alternative to existing approaches.
Keywords

Introduction
Recently, deep convolutional models have had success in many applications related to image understanding [1, 2, 3, 4]. Apart from common industrial problems (object detection, person identification, etc.), convolutional neural networks were successfully applied to biomedical challenges, such as electroencephalogram analysis [5, 6, 7, 8, 9, 10], gait recognition [11] and biomedical image segmentation [12, 13, 14]. In this paper, we mainly consider the latter. Due to medical specificity, biomedical image segmentation comes with a number of challenges, such as small amounts of data and severe class imbalance.
Class frequencies (a) of a balanced dataset; (b) of a dataset where instances of only the 4th class are less frequent; (c) of a dataset where instances of class #3 and #4 appear more frequently than the instances of classes #1 and #2.
A class imbalance [15] is a situation when samples of some classes appear more frequently in the dataset than the samples of other classes (see Fig. 1). Class imbalance influences the classification results of many algorithms [16, 17, 18], especially machine learning algorithms [19, 20]. Due to this problem, the classification rate on samples of the rare (minority) classes is low [21, 22]. This problem is particularly relevant to biomedical challenges [23, 24], as it is common that the information about a disease is contained in the samples of the less-populated classes, causing them to be underrepresented. Underrepresentation leads to poor model performance on those classes; consequently, a disease-detecting test may give incorrect results. Another reason biomedical data are often imbalanced is that different diseases have different (often low) rates of appearance (e.g., kidney failure). This results in the fact that images of one type (healthy subjects) are easier to collect because they are more common than the others (subjects that have a disease). Therefore, obtaining a well-balanced dataset is often not feasible.
In order to tackle the class imbalance problem, various kinds of methods have been proposed. Existing data balancing methods operate either at an algorithm level (no data modification) or at data level (resampling, modifying data). The latter are based on two common techniques [25]: undersampling and oversampling. With undersampling, data samples that belong to the predominant (major) classes are removed. With oversampling, data samples that belong to the rare (minority) classes are generated (or copied).
These techniques can be easily applied when each data sample belongs to a single class; however, data balancing becomes challenging when each data sample belongs to multiple classes. In this case, each data sample is associated with a set of labels (labelset) instead of only a single class label. Such datasets are called multi-label datasets. Applying oversampling or undersampling to such datasets is challenging, because copying or deleting any datum points causes changes in the distribution of several classes. An example of such a change in the classes’ distribution could be oversampling of instances of one of the minority classes – this particular minority class will become more frequent, yet frequencies of other minority classes may decrease, making the overall class imbalance worse. This problem is called a multi-label data imbalance [26, 27].
A number of multi-label data balancing methods have been proposed to solve the multi-label class imbalance problem. However, to the best of our knowledge, none of the existing data balancing approaches, with the exception of manual balancing, can be applied to image data in the context of image segmentation. This is mainly because the existing data balancing approaches do not perform copying or deleting of samples of the dataset, but perform the generation of new data samples using heuristics, such as data interpolation, which is not applicable in the case of images.
Here, we propose a new method for multi-label data balancing in order to address multi-label data imbalance in biomedical image data. The effectiveness of the proposed method is validated on two highly imbalanced biomedical image datasets. The results of the experiment show that the proposed method improves the original classification results. Apart from image data, the proposed method has been applied to seven common multi-label datasets to illustrate its wider applicability beyond image data. A comparison to existing multi-label data balancing approaches is provided, as well.
The paper is structured as follows:
Section 2 presents a survey of existing data balancing approaches. Section 3 provides a detailed description of the proposed method, presents its implementation details and gives recommendations regarding its application. Section 4 presents a thorough experimental study that validates the effectiveness of the proposed method.
This section provides an overview of existing data balancing methods and algorithms.
To overcome data imbalance, different methods and algorithms have been proposed. These can be divided into three main groups:
(I) Algorithm-level approaches, which do not modify the distribution of the data but evaluate (in terms of, for example, misclassification cost) instances from different classes differently in correspondence with the number of instances of a certain class in the data. This type of approach is the most common – examples may include:
(II) Data-level approaches, which modify the distribution of the instances so that the minority class is adequately represented in the dataset. A more detailed analysis of approaches of this type is given below.
(III) Mixed approaches (combining both algorithm-level and data-level approaches). It should be noted that in this area of research, there is a lack of comparative studies that can provide conclusions on which approach should be used to obtain the best result in a given scenario. As some of such studies show [35], overall, the hybrid strategies are more effective. One of the examples of mixed approaches is proposed in [36], where a dynamic GAN is used to generate realistic-looking data samples of a minority class (data-level approach), thus balancing the dataset, and the parameters of the subsequently used classifier depend on the evaluations of the changes in the imbalance ratio and the samples’ distributions (algorithm-level approach).
However, due to the fact that the proposed method is a data-level approach, in this paper, we focus only on this type of approach (namely, the resampling algorithms). These approaches can be differentiated into two subgroups:
undersampling algorithms and oversampling algorithms.
(i) Undersampling approaches, in essence, are based on the removal of data samples. One of the data balancing approaches of this type, Multi-Label Tomek Link (MLTL) [37], is based on Tomek link [38] undersampling. Two instances define a Tomek link if, firstly, one is the nearest neighbor of another and vice versa, and secondly, if they belong to different classes. Originally, the Tomek link undersampling algorithm addressed a single-label classification problem. MLTL is a generalization of the Tomek link algorithm for multi-labeled datasets. The MLTL approach uses IRLbl and MeanIR metrics to determine the minority classes as well as the adjusted Hamming distance to calculate differences between the labelsets. The essence of Tomek link undersampling is in removing the closest samples (in feature space) with different labelsets from the dataset.
(ii) The oversampling approaches, conversely, are based on the generation of new data samples. This technique was employed in [38, 39, 40, 41].
Works [39, 40] were focused on single-label datasets. In SMOTE [40], a new data sample from a minority class (minority data sample) is generated by adding a vector to a data sample of this class. The vector is obtained by multiplying a vector between the considered data sample and one of its k-nearest neighbors (in feature space) of the same class by a random number. ADASYN [39] is based on SMOTE and proposes an idea of adaptively generating data samples of the minority classes according to their distributions.
Both SMOTE and ADASYN approaches are focused on binary classification only. Both approaches can be easily applied to a single-label classification problem on structured data. These approaches are not applicable to multi-label class imbalance problems, but they are often used as a base for multi-label balancing approaches.
MLSMOTE [41] is an adaptation of the SMOTE approach for the multi-label problem. In order to select minority classes and measure the degree to which the dataset is unbalanced, the authors suggest considering all labels separately and calculating two metrics: Imbalance Ratio per Label (IRLbl) [27] and Mean Imbalance Ratio (MeanIR) [27]. The IRLbl value shows the imbalance level for a specific label. The MeanIR stands for an average level of imbalance in the dataset. In the MLSMOTE approach, a class is considered minor if its IRLbl is lower than MeanIR. Special attention is given to synthetic data generation, especially to synthetic labelset production.
A further development of this approach was proposed by the same authors in [42]. In this paper, the authors consider the concurrence between imbalanced labels, proposing special metrics, namely SCUMBLE (Score of ConcUrrence among iMBalanced LabEls) and SCUMBLELbl, in order to assess this concurrence along with the resampling algorithm, called REMEDIAL (REsampling MultilabEl datasets by Decoupling highly ImbAlanced Labels).
In summary, it should be noted that, on one hand, approaches based on the undersampling technique are unsuitable for small datasets, since the removal of any of the data sample may lead to overfitting. This is especially the case for biomedical datasets, which commonly consist of very few data. On the other hand, the aforementioned oversampling approaches are not applicable to images. MLSMOTE, ADASYN and REMEDIAL use interpolation between data samples to generate new data, which implies that the number of features is always the same, and that each of the features are independent of each other. Such assumptions, however, do not hold for images. Firstly, images may differ in size, and secondly, images exhibit a spatial structure, meaning that pixels are dependent on their neighbors, which goes against the assumption about features’ independence. As a result, interpolation in the case of images would lead to degenerate data samples.
The method proposed in this paper neither focuses on the feature space for new data generation nor uses undersampling. It performs balancing on a data level via oversampling in a way that increases the frequencies of minority classes. In this context, the frequency of a class means the percentage of data samples containing that class. This method allows us to simultaneously increase the frequencies of minority classes, which is especially challenging in the case of multi-label datasets, the reason being that during manual balancing, increasing the frequency of one class may lead to a decrease in the frequency of another class. This makes the proposed method superior in cases where there are only few data available or if relying on the feature space is impossible (as in the case of images) or when manual oversampling is difficult (which is the case when there are a lot of classes in the data).
In this section, we provide a detailed description of the proposed method and its implementation details, and give a thorough explanation of how to apply it.
In the multi-label setting, data imbalance is expressed as some of the classes being significantly more frequent than the others. An ideal case would be when all labels have frequencies close to 1. The proposed method balances the data by oversampling the minority classes (duplicating data samples containing those classes), making it as close to the ideal case as possible given the constraint on the number of duplicates that can be generated.
It should be noted that the described ideal case (all frequencies close to 1) is true in the context of biomedical image segmentation, where it is desirable to have each class present in every image. This, however, may not be desirable in the context of multi-label classification, as individual labelsets present a form of compositional classes. In this case, making all classes’ frequencies close to one may harm data variety, making certain labelsets abundant compared to others. In this case, other approaches of data balancing may be preferable. Nevertheless, as the experimental results show, the proposed method is able to improve original classification results even in the context of multi-label classification.
The data balancing process, i.e., oversampling, is formulated as an optimization problem. The formulated problem is then solved using numerical methods.
Detailed description
We make a loose assumption that increasing the frequencies of minority classes will make them better represented in the dataset, hence improving the accuracy of learning algorithms on those classes. To compute frequencies of all classes, we construct a binary representation
A simple binary representation example. Here, 
We then construct a set
where
Increasing the frequency of the
where
With such a formulation, oversampling can be performed by increasing
We omit the fact that there is only a finite set
We strive for the ideal case of
However, such a problem is ill posed for two reasons:
The range of In some cases, maximization of
In order to solve the first issue, we constrain
This way, we ensure that
In order to solve the second issue, i.e., constrain the number of generated copies, a second constraint on the
This regularization function forces the optimal solution to have fewer copies.
The final objective function to be minimized is formulated as:
where
Once the optimal solution
The obtained
Finding the optimal solution analytically is not feasible because the formulated optimization problem is highly non-linear. Therefore, we employ numerical methods, such as the Newton method, to find the optimal
A naive implementation of the Newton method turned out to be numerically unstable, as the hessian of the objective function is often degenerate. In order to improve the issue, an identity matrix multiplied with a positive scalar is added to ensure that the hessian is a positive definite. To make the optimization process less computationally intensive, the hessian is cached and recomputed only every eighth iteration. In all of our experiments, we perform optimization for 512 iterations, and the identity matrix is multiplied by 2.
We compared optimization methods available in the SciPy [43] library to our implementation of the Newton method. Our study has shown that our solution has better convergence.
Gradients and the hessian with respect to
Applying the proposed method
When applying the proposed method, one first needs to map the dataset into its binary representation. Then, unique binary vectors
Once preliminaries are finished, one must find an optimal value for the
If OB equals 1, the number of generated duplicates equals the number of data samples in the initial dataset. In this case, the resulting balanced dataset will be twice as large as the initial dataset.
The optimal OB value depends on the specifics of the dataset and the learning algorithm one uses. We recommend generating several versions of the balanced dataset with OB set to 1, 2 and 3. By comparing which option results in the best performance of the learning algorithm, one can find the optimal OB value. The number of generated duplicates is a monotonic function of
Once OB is set and the corresponding
The computed
Group data samples by Choose an index Choose randomly (using uniform distribution) a data sample that corresponds to the unique binary vector
To ensure that all the data samples are being used, we choose them in a loop instead of sampling uniformly.
With such implementation, it may not seem obvious when actual oversampling happens, since no explicit copies are created. In this case, oversampling is carried out during training of the learning algorithm by choosing some of the data samples more often than others in accordance with the distribution
One may use the obtained
In our previous work [45], a similar method was presented. Though the general framework remains the same, the new method is much better developed in several regards:
The new method is much better theoretically grounded. The objective function is derived using probabilistic principles, whereas the old method uses L2-norm without any theoretical backing for such choice. There was no clear instruction on how to perform data balancing. In contrast, in this paper, we detail the way a dataset must be balanced via the introduction of the oversampling budget (OB) concept, which has an intuitive interpretation of a relative number of samples. The new method presents a much better parame-trization for its parameters, whereas parametrization in the old method may lead to a degenerate solution with negative The old method had much more hyperparameters, such as target class frequencies, regularization scale and gradient descent optimizer, as well as more optimization hyperparameters, such as number of iterations and the step size. Contrarily, the new method requires one to specify only the OB value, as the rest is already specified (optimizer) or automated (finding optimal regularization scale). The OB value has a natural interpretation of the relative number of duplicates. It is easy to find an optimal value for it, as the range of values is not large and concrete recommendations are given regarding the initial choice of OB.
In this section, a thorough experimental study is conducted in order to assess the effectiveness of the proposed method. It consists of three subsections: in the first subsection, the proposed method is applied to seven common multi-label datasets, and a direct comparison to existing data balancing approaches is provided; in subsections two and three, the proposed method is applied to two highly unbalanced biomedical image datasets which other approaches cannot be applied to. The results indicate that the proposed method is able to improve the original classification results and presents a competitive alternative to existing data balancing approaches.
Comparison to existing approaches
In this subsection, a series of experiments is conducted to compare the proposed method to existing data balancing approaches. We compare our method to MLSMOTE [41], MLTL [37] and REMEDIAL [42]. An open-source implementation of the listed data balancing approaches, the Mulan library [47], was used during the experiments.
All of the approaches, including the proposed one, were applied to seven multi-label datasets that are publicly available [47]: corel5k, emotions, flags, scene, bibtex, medical and yeast. Table 1 shows key characteristics of these datasets.
Description of the datasets
Description of the datasets
Class frequencies for the flags dataset
In order to balance the data using other approaches, we use the default parameters provided in the Mulan library for the corresponding algorithms.
In the case of the proposed method, we created two balanced versions of the initial datasets: the first one with OB
After balanced versions of the datasets had been created, we computed class frequencies for some of them to compare them to the original class frequencies. Tables 2–4 contain class frequencies for flags, emotions and scene datasets, respectively. Class frequencies obtained with our method are denoted as Our (1) and Our (2) for OB
Class frequencies for the emotions dataset
Class frequencies for the emotions dataset
Class frequencies for the scene dataset
It should be noted that some of the datasets, such as emotions and scene, do not seem to have a pronounced imbalance. Still, those datasets are often used in the related literature, so we include them in our experiments to provide a more detailed comparison to other methods.
Two learning algorithms were trained on the datasets: decision tree and multilayer perceptron. We used implementation provided by the Sci-Kit Learn library [50]. During the experiments, the default values of the hyperparameters were used.
Results
In this section, we present the results obtained after training a decision tree and a multilayer perceptron on both balanced and unbalanced versions of the datasets. The results were averaged across three runs of training the learning algorithms.
Precision
Precision
Recall
Weighted F-score
Precision, recall and weighted F-score were used to assess the classification results. A special emphasis must be put on the recall metric, as it is a direct measure of how sensitive a learning algorithm is to a particular class. In other words, higher recall for a certain class means more samples of this class are recognized, which is of high importance in the case of minority classes. The measurements for precision, recall and weighted F-score are presented in Tables 5.1–5.3, respectively. The best values are highlighted in bold. We also use italics to denote values that are better than those obtained with original (unbalanced) versions of the datasets. Similar to Tables 2–4, our method is referred to as Our (1) and Our (2) for OB
Observing the results in Tables 5.2 and 5.3, we can conclude that the proposed method yields a consistent gain in recall and F1-measure in most of the cases compared to the results obtained with original datasets, except for the scene dataset, although in some cases, this gain in recall is traded off for precision. An increase in the number of data samples containing minority classes seem to increase the sensitivity of learning algorithms on those classes. Nevertheless, no new data (new information about classes) are being added, meaning that the classes are still purely represented, which may disallow the learning algorithm from improving its ability to distinguish minority classes from other classes, which explains the occasional drop in precision. Still, the observed drop in precision, if it happens, does not seem to be dramatic and can be tolerated considering gains in recall.
Compared to other approaches, the proposed method is consistently better in the case of a multilayer perceptron. In the case of a decision tree, MLTL performs similarly to the proposed method, whereas the other algorithms give considerably worse results.
Overall, experimental results indicate that the proposed method allows us not only to improve original classification results, but also performs comparatively better than the other algorithms in most of the cases.
In this section, we describe an experiment in which the proposed method was applied to the problem of biomedical image segmentation. The dataset was balanced using the proposed method and then used to train a convolutional neural network (CNN) to perform image segmentation. The results obtained with the proposed method are compared to those of manually balanced data and the original (unbalanced) data. No results were obtained for the other data balancing approaches since they cannot be applied to this kind of data due to the specifics of their data balancing process. We address the question of application of the other methods to image datasets in Section 4.2.3.
Fundus images dataset
A dataset of fundus images was used for the experiment. The dataset consisted of 115 images, each labeled with a segmentation mask. There were eight classes in total, excluding background. The dataset was provided by Samara Regional Clinical Ophthalmological Hospital, named after Eroshevsky [44], which is not publicly available. The following eight classes were used: optic disk (1), macula (2), blood vessels (3), retinal hemorrhage (4), soft exudate (5), new coagulates (6), pigmented coagulates (7) and solid exudate (8). This dataset was used for training a CNN, the input of which is a fundus image and the output is a corresponding segmentation mask.
Cross-validation sets
The presented dataset was severely imbalanced, since classes 5, 6 and 7 were roughly 10 times less frequent compared to classes 1–3. Only a few images contained minority classes, meaning that the results of accuracy evaluation may not be representative. Figure 3 shows the class distribution among all samples; minority classes are marked with dashed lines. In order to tackle this problem, k-fold cross-validation was used. Three folds with nonoverlapping validation sets were created to ensure independence of the validation results. Each validation set consisted of 16 images and each training set consisted of 99 images. Images for the validation sets were chosen in a such way that the validation sets’ class frequencies were roughly the same. The final evaluation results were obtained by averaging among the three folds.
Numbers of images that contain the i-th class.
Each of the folds were balanced using the proposed method and via manual oversampling. In both cases, OB was set to 2.
Due to the fact that each sample contained several classes, it was difficult to perform balancing manually. Applying the oversampling technique was challenging, since copying samples containing the desirable minority classes affected the distribution of all the other classes. In particular, copying samples containing class 6 resulted in lowering the frequency of class 7, which may lead to worse performance on this class. Undersampling was not an option since the dataset was too small and any loss of data was intolerable.
The proposed method allowed us to easily increase class frequencies, where minority classes were not as rare. In our experience, balancing using our method took no longer than several minutes. Although the optimization process is computationally intensive due to computation of an inverse hessian, it takes an insignificant amount of time compared to manual oversampling. It should be noted that the overall data balancing process is extremely simple, as only the OB value must be specified. Once the OB is set, the optimal value for the
Methods such as MLSMOTE, REMEDIAL, ADASYN and MLTL are not applicable in the case of medical imaging, since those methods perform balancing via new data generation. New data samples are generated via interpolation of existing data samples. Such a data generation process makes an assumption that interpolation between neighboring data samples creates a new valid data sample, whereas in the image context, this is not true. Interpolation between images does not yield a valid image. Furthermore, such interpolation may not be possible at all when images are of different sizes. On the contrary, our method does not make any assumptions about the underlying structure of the data and works by solely duplicating some of the data samples, which makes it a general approach to balancing multi-label data or any data that can be described in a multi-label framework from Section 3.1.
Average class frequencies for different cross-validation folds (training set)
Average class frequencies for different cross-validation folds (training set)
The results of data balancing are shown in Table 6. The table contains average class frequencies for training sets of each fold obtained with our method, manual oversampling, as well as class frequencies of unbalanced folds.
In order to assess the performance of the trained neural network, the conventional metrics such as recall or precision are not well suited, since some of the classes occupy very little area in images. The occupied area percentage for each class is shown in Table 7. Therefore, apart from precision, recall and F-score, we report per class Dice [42] values, which is a common metric employed in biomedical applications. It is well suited for cases when some classes occupy little area in images.
Percentage of occupied image area by every class in the dataset
Percentage of occupied image area by every class in the dataset
It is worth noting that the issue with little area adds a new layer of complexity to the problem in the form of “intra-class” imbalance, when some classes are not only underrepresented in the dataset as a whole, but are also underrepresented within a single data sample (occupy only a few pixels of an image). Though it is not explicit, the proposed algorithm works with an assumption that every class is equally well represented within individual data samples and that each class is equally difficult to learn. Addressing this limitation will be the subject of our future research. Nevertheless, the experimental results show that the proposed method is able to bring improvements even in such a complex setting.
Biomedical specificity does not allow one to obtain large amounts of data. Hence, datasets are often small and unbalanced. The dataset used in our experiment consisted of only 115 images, and aggressive data augmentation was required to prevent the neural network from overfitting to the data.
The data were augmented dynamically during training using random flips and rotations [51] in the range of [
Biological tissue is prone to deformations. In order to make the network invariant to such corruptions, elastic image deformation was used, as it is an effective technique commonly employed in biomedical applications. An example of applying elastic augmentation to an image is shown in Figs 4 and 5.
It should be noted that the data augmentation techniques used here may affect classes within the data samples. For example, elastic image deformation may change the area that is occupied by some classes within the image. However, our method does not take this into consideration and assumes that within a data sample, each class is equally well represented.
Results of the experiment on fundus data. First half of the table contains Dice values for the experiment with no data augmentation. The second half is analogous but for the experiment with data augmentation. The best values are highlighted in bold
Results of the experiment on fundus data. First half of the table contains Dice values for the experiment with no data augmentation. The second half is analogous but for the experiment with data augmentation. The best values are highlighted in bold
Results of the experiment on fundus data. Contains average values for recall, precision and F-score. The best values are highlighted in bold
To analyze the effects of augmentation on final results, we conducted a second experiment in which the CNN was trained with no augmentation.
Original fundus image.
Elastic augmented fundus image.
In order to perform image segmentation, a CNN was built and trained. U-Net [52] was chosen as the base architecture of the neural network, as it was built specifically for biomedical segmentation purposes. Xception-65 [53] pre-trained on the ImageNet dataset [54] was used as a feature extractor. The network was trained using the Adam optimizer [46] and the learning rate was set to 0.0045. FocalLoss [29] was used as a loss function as it was proved to be effective in class imbalance settings. The network was trained for 2000 iterations, and the batch size was set to eight images.
Results
Tables 8 and 9 show that training on the data balanced by the proposed method data yielded better results compared to training on the original data and manually balanced data. Better values are indicated with bold. Similar to Tables 5.1–5.3, we use italics to denote results that are better than those obtained with original (unbalanced) data.
Predicted binary mask for 4th class.
An example of the CNN’s predictions is shown in Fig. 6. Figure 7 shows the corresponding ground-truth segmentation mask, giving an effective illustration of how the proposed method increases the sensitivity of a learning algorithm, corresponding with the experimental results presented in Section 4.1.3.
The proposed method outperforms manual oversampling for every minority class (5, 6, 7) in the experiment where no augmentation was used. However, the Dice is worse for classes 2 and 8. The reason for this sudden drop seems to be that the network is prone to memorize those particular classes in the presence of a large number of duplicates. It can be seen from the results of the experiment with data augmentation that it is crucial to perform augmentation to prevent the CNN from overfitting on those particular classes. The proposed method does not take into account such specifics of individual classes and assumes that every class is equally well represented in a data sample and is equally difficult to learn by a learning algorithm. We will be addressing this particular issue in our future research.
In the case of using data augmentation, balancing using the proposed method still outperforms the other two in terms of mean Dice. Dice, for most of the classes, is higher compared to both manual oversampling and the original (unbalanced) dataset.
An important factor to note is that data balancing via oversampling was challenging and time consuming, yet it did not yield considerably better results compared to the unbalanced version. On the contrary, it was easy to balance data using the proposed method, which also yielded better results compared to both manual oversampling and the original dataset.
The overall results suggest that the proposed method allows us to reduce the influence of class imbalance in an image dataset on the learning algorithm. Although this setting is much more complex since labels are segmentation masks instead of simple binary vectors and classes are not represented equally well among data samples, the proposed method still allows us to improve upon original segmentation results.
In this section, we describe a similar experiment performed on the Breast Cancer Semantic Segmentation (BCSS) dataset, which was first introduced in [55]. The dataset was balanced using the proposed method and then used to train a convolutional neural network (CNN) to perform image segmentation. Similar to the previous experiment, the results obtained with the proposed method are compared to those of manually balanced data and the original unbalanced data. No results were obtained for the other data balancing approaches since they cannot be applied to this kind of data due to the specifics of their data balancing process.
Ground-truth binary mask for 4th class.
In this subsection, we detail only cross-validation info, data preprocessing and balancing of the dataset. The rest of the experimental settings were similar to those described in Sections 4.2.4, 4.2.5 and 4.2.6.
The BCSS dataset consists of 151 images of cancer tissue and contains 21 classes in total. In our experiments, only eight classes were used; the rest of the classes were labeled as background. The list of the classes is as follows: background (0), tumor (1), stroma (2), lymphocytic infiltrate (3), necrosis or debris (4), blood (5), exclude (6), fat (7) and plasma cells (8). Similar to the fundus dataset, this dataset was used for training a CNN, the input of which is a cancer tissue image and the output is a corresponding segmentation mask.
Cross-validation sets
Similar issues are encountered in the case of the BCSS dataset: some of the classes are extremely underrepresented and are contained only within a few samples. Figure 8 shows class distribution among all samples in the dataset; minority classes are marked with dashed lines.
To ensure representativeness of the CNN’s evaluation, k-fold cross-validation was used. Similarly, three folds with nonoverlapping validation sets were created to ensure independence of the validation results. Each validation set consisted of 16 images and each training set consisted of 135 images. The final evaluation results were obtained by averaging among the three folds.
Data preprocessing
In contrast to the fundus dataset, images in the BCSS dataset vary in sizes, ranging from 1000 pixels to 9000 pixels in width and height. To solve the issue, each image was sliced into nonoverlapping pieces of 900
This procedure was applied to both training and validation sets. After preprocessing, each fold contained roughly 2150 training images and 220 validation images.
Numbers of images that contain the i-th class.
The BCSS dataset was balanced in a similar manner to that of the fundus dataset. Each of the folds were balanced using the proposed method and via manual oversampling. In both cases, OB was set to 2.
In contrary to the fundus dataset, manual oversampling of the BCSS dataset was much harder since its training sets contained 2150 images, whereas in the fundus dataset, there was only 99. During balancing, a similar issue was encountered as in the case of the fundus dataset – increasing the frequency of one minority class led to a decrease in frequencies of other minority classes. In our experience, manual balancing of a single data fold took up to several hours. It should be noted that careful manual data balancing on the scale of thousands of images becomes unfeasible and, as the results will show, may even harm the performance of the learning algorithm in some cases.
As in the case of the fundus dataset, applying the proposed algorithm was easy and allowed us to balance the BCSS dataset in less than 10 minutes.
Average class frequencies for different cross-validation folds (training set)
Average class frequencies for different cross-validation folds (training set)
Percentage of occupied image area by every class in the dataset
Results of the experiment on BCSS data. The best values are highlighted in bold
Results of the experiment on BCSS data. Average values for recall, precision, F-score and AUC
Since the BCSS dataset consists of images, the application of methods such as MLSMOTE, REMEDIAL, ADASYN and MLTL was not an option. These algorithms rely on data interpolation in order to generate new data samples, and in the case of images, this leads to the generation of degenerate data.
The results of data balancing are shown in Table 10. It contains average class frequencies for training sets of each fold obtained with our method, manual oversampling as well as class frequencies of unbalanced folds. We also provide information about the area each class occupies in an image in Table 11.
Tables 12 and 13 show that the proposed method performs better on average compared to the results obtained with manual oversampling and original (unbalanced) data. The best values are indicated in bold; italics are used to denote results that are better than those obtained with original (unbalanced) data. Although gains are not very pronounced, the proposed method does bring an improvement in most of the classes. It should be noted that with the proposed method, the CNN is able to recognize the background class (class 0). However, a drop in Dice values for some classes can be observed. From this, we can conclude that increasing the frequency of a class may not necessarily lead to improvements in this class. As in the case of the fundus dataset, the proposed method may trade off the classification quality of some classes for other classes. We relate such behavior to the classes’ dynamics, which our method does not consider during balancing. Our future research will be devoted to addressing this issue.
As for manual oversampling, it performs worse on average even compared to unbalanced data. Though careful balancing was attempted, no considerable improvement was attained. This shows the importance of automatic balancing in cases with large numbers of data samples. The proposed method not only allows us to perform data balancing quickly, but also executes it more optimally compared to manual balancing, without hurting the algorithm’s classification performance on average.
The overall results suggest that the proposed method does bring an improvement to the original classification results, yet it may not be the case for some classes. The method may trade off performance in some classes for other classes, but this happens in a minority of classes, and in some cases, may be tolerated considering gains.
Conclusion
In this paper, we proposed a method for biomedical multi-label data balancing. The method can be applied in the case of semantic segmentation problems for balancing the corresponding image data. The other key feature of the proposed method is that it does not make any assumptions regarding the underlying structure of the data to be balanced. In other words, the proposed method can be applied to multi-label data or other data that can be described using a multi-label framework, making it general and widely applicable.
A thorough experimental study of the proposed method was conducted. Firstly, to support our claim about the wide applicability of the proposed method, we applied our method to seven common multi-label datasets and compared the results to those of other multi-label data balancing approaches, such as MLTL, MLSMOTE and REMEDIAL. The results of the experiments show that the proposed method allows us to bring a consistent improvement in classification results in almost all cases, making it a competitive alternative to existing approaches. Secondly, we applied the proposed method to two highly unbalanced biomedical image datasets, to which other data balancing approaches could not be applied. The initial biomedical image segmentation problem was reformulated using a multi-label framework in order to perform balancing of the image data. The results obtained with the proposed method were compared to those obtained for a manually balanced version of the datasets as well as their original (unbalanced) versions. The experimental results show that the method allows us to reduce the influence of class imbalance on the learning algorithm, improving its original classification results.
However, it should be noted that the improvements may not be achieved for all the classes. This is related to how classes behave within a data sample and the dataset as a whole. The proposed method is not sensitive to such dynamics, assuming that each class is equally well represented within a data sample and is equally hard to learn, which can be considered a drawback. Future research will be devoted to addressing this issue. One possible research direction is to incorporate a form of weighing individual data samples that would indicate how well a certain data sample represents a particular class. Another research direction is to use the proposed method in conjunction with more sophisticated and advanced learning algorithms, such as [56, 57, 58, 59]. Nevertheless, drops in performance in other classes can be tolerated considering the gains. It also should be noted that balancing using the proposed method is fast, especially in comparison to manual balancing, which may take up to several hours.
The overall results indicate that the proposed method can be successfully used to balance multi-label data. It allows us to perform data balancing quickly and still attain comparable or better results compared to manual oversampling, which may be difficult and time consuming. It does not rely on feature space to perform oversampling, unlike MLSMOTE, MLTL and REMEDIAL, which makes it superior in cases when such reliance is impossible (for example, image data).
Footnotes
Project is available via the following link:
Acknowledgments
This work was funded by the Russian Foundation for Basic Research under RFBR grant # 19-29-01135 and the Ministry of Science and Higher Education of the Russian Federation within a government project of Samara University and FSRC “Crystallography and Photonics” RAS.
