Multiple independent losses scheduling: A simple training method for deep neural networks

Abstract

In recent years, various loss functions have been proposed to boost the performance of deep neural networks. Every loss function has its own specific theoretical motivation, and can easily learn its preference features of training data compared with other loss functions. Thus, combining multiple loss functions to capture more data features becomes an attractive idea for model performance improvement. In this paper, instead of using a single loss function or a linear weighted sum of multiple loss functions, we present the method named Multiple Independent Losses Scheduling (MILS), which allows multiple loss functions to independently participate in the training process according to their performance. Specifically, for all candidate loss functions, one loss function will be predefined as the primary loss function before training, and the other loss functions will play auxiliary roles for possible contributions to improve the model performance. In order to avoid auxiliary loss functions bringing a negative effect on the model performance in the training process, we developed a simple but effective performance-based scheduling algorithm to prevent auxiliary loss functions from dragging down the model performance. Extensive experiments using various deep architectures on various recognition benchmarks demonstrate our scheme is simple, robust, lightweight, and effective for typical classification tasks.

Keywords

Multiple independent losses scheduling loss function preference features deep neural network models

1. Introduction

Deep convolutional neural networks [1, 2] have been widely used to achieve state-of-the-art performance in speech recognition [3, 4], visual object recognition [2, 5, 6], object detection [7, 8, 9], which allows computational models, composed of multiple processing layers, to learn feature representations of data with multiple levels. Since many practical applications can be regarded as parametric function approximation problems, deep learning approaches often define a mapping $y=f(x;\theta)$ , and try to learn the value of the weight parameters $\theta$ that result in the best function approximation. More specifically, a deep learning model with its loss function can be considered as learning a mapping function that maps training data into a feature space where the task can be accomplished. However, to achieve this aim, the deep model is supposed to approximate the correct output, even for inputs that have not been shown during training. For many deep learning tasks, training data may appear to be less homogeneous, which has complex data distributions and often makes a deep model incapable of capturing enough desired features. Consequently, a perfect function approximation based on all desired features is almost impossible.

An important aspect of the design of a deep model lies in the choice of the loss function, which can affect learning dynamics and final performance of the deep model. The loss function is just like a ruler for the training process, which is used to measure the model and guide the training procedure of the model. Therefore, the design of loss function is as important as the design of model architecture. Although early loss functions are very simple, they still achieve excellent performance and are widely used in various deep learning tasks. A typical example is the softmax cross-entropy loss (softmax loss for short) function for image classification tasks, which is used to measure the error between model outputs and data labels. In recent years, to further improve the performance of deep models, some researchers have tried to develop some specially designed loss functions [10, 11, 12, 13, 14, 15] for targeted tasks, and these loss functions are often formalized as a linear weighted sum of multiple loss functions. Essentially, the joint loss functions can still be considered as a single loss function, which has been used in the whole training process for weight parameters estimation.

Figure 1.

Feature space of training data. (a). The features in the left cycle mean the features can be learned by loss A, and the features in the right cycle mean the features can be learned by loss B. (b). In an ideal situation, all features can be learned by combining loss A and loss B.

Obviously, every loss function has its own specific theoretical motivation [16] and can easily learn its preference features of training data compared with the other loss functions. Suppose we do not have any intuition or special knowledge about some specific applications and datasets, how can we find an adaptive and effective approach for learning more meaningful features in a general way? As shown in Fig. 1, a simple and straightforward idea is the utilizing of multiple independent loss functions, which combines two different loss functions and obtains more meaningful features than a single one. Thus, combining multiple independent loss functions to learn more data features becomes an attractive idea for deep network performance improvement. In this paper, instead of using a single loss function or a linear weighted sum of multiple loss functions (joint loss), we present the approach named Multiple Independently Losses Scheduling (MILS). MILS allows multiple loss functions to participate in the training process alternately and independently. Compared with continuous training using a single loss function, MILS only introduces a little overhead but brings an obvious performance improvement. Extensive experiments using various deep network architectures on various recognition benchmarks demonstrate our proposal is simple, robust, lightweight, and effective for typical classification tasks. The original contributions that we have made in this paper are highlighted as follows:

•

To the best of our knowledge, we are the first to consider the independent use of multiple loss functions in one training process, which aims to further utilize training datasets without modifying model architecture.

•

We develop a simple but effective training approach with multiple independent loss functions. The key insight is allowing the multiple loss functions to introduce their preference features of training data to guarantee that the deep model can learn more meaningful features.

•

Various deep network architectures trained with the proposed approach show nice performance boost on different object recognition benchmarks, and the source code is publicly available for exploitation.

The remainder of this paper is structured as follows. We make a brief overview of related works in Section 2 and illustrate the motivation of MILS in Section 3. Then, we describe the details of MILS in Section 4. Section 5 evaluates our proposal through extensive experiments. Finally, we summarize the paper in Section 6.

2. Related works

The loss function is a critical part of any machine learning or deep learning model, and loss functions define an objective against which the performance of your model is measured, and the setting of weight parameters learned by the model is determined by minimizing the chosen loss function. Some common loss functions have been widely used in various tasks, such as hinge loss, softmax cross-entropy loss, KL divergence loss, etc. Although these common loss functions achieve acceptable performance in various tasks, researchers still want to develop specially designed loss functions to improve model performance further.

Joint loss functions

Large-margin Gaussian Mixture loss [15] is proposed for classification, which intuitively makes the assumption that the deep features of the training dataset follow a Gaussian Mixture distribution. It is proved superior to the softmax cross-entropy loss function with a high classification performance. Similarly, focal loss [13] discovers the extreme foreground-background class imbalance in object detectors. With such knowledge, it reshapes the standard softmax cross-entropy loss to down-weight the loss assigned to well-classified examples, and trains a high-accuracy detector that significantly outperforms others. The approach of agreement learning [17] defines an objective function that incorporates both data likelihood and a measure of agreement between two models. Besides large-margin Gaussian Mixture loss and focal loss discussed in Section 1, center loss [11] is first proposed to learn a center for deep features of each class and penalize the distances between the deep features and their corresponding class centers. Center loss achieves state-of-the-art accuracy on several important face recognition benchmarks. Similarly, angular softmax loss [14] and later studies including large-margin softmax loss [10], quadruplet loss [12], large margin cosine loss [18], additive angular margin loss [19], also aim at obtaining the deep features with the two key learning objectives, inter-class dispersion and intra-class compactness as much as possible. Zhen et al. [20] investigated a nonlinear combination of multiple loss functions, which allows adaptively tuning the weights of different loss functions to guide the optimization. But it seems not easy to be tuned for every specific task or dataset.

Regularization methods

Also, there has been lots of work dealing with loss functions, in the form of $L1$ and $L2$ regularization schemes. The efforts with $L1$ penalties specifically applied to train very sparse networks [21] or even binary neural networks [22, 23] with the goal of faster computation. In recent, a mode seeking regularization term [24] is proposed to address the mode collapse issue for conditional generative adversarial networks. The studies with $L2$ penalties, also called weight decay, often alleviate model over-fitting or improve adversarial examples [25]. Elastic-net regularization, a linear combination of $L1$ and $L2$ penalties, is used to solve high-dimensional feature selection problems or attack deep neural networks via adversarial examples [26]. A multi-loss regularization framework [16] is proposed for the generalization of deep neural networks in two stages, in which the model with the single-core-loss function is pretrained firstly, and the outputs with different loss functions are fused with average pooling to produce the ultimate prediction in the testing stage.

Generally, existing multiple loss functions have been used for various tasks and applications. Still, previous studies often combine multiple loss functions or penalties into a new loss function (a weighted sum of multiple loss functions), just like a single loss function, which means this function can only learn a single feature preference. Unlike the existing usage of multiple loss functions, we first proposed MILS which uses multiple loss functions independently and alternately in the training process. MILS lets the deep model learn different feature preferences according to the model performance of different loss functions.

3. Motivation

Suppose the deep network architecture, training dataset, and hyperparameters (such as learning rate and weight decay) are defined in advance; in that case, the choice of loss functions will decide the final performance of this model. In general, based on the knowledge of the training dataset and the past experience in related works, researchers will choose an appropriate existed loss function or develop a new loss function to enhance the model performance as much as possible. However, we believe that although a single loss function may significantly outperform other loss functions in the scenario mentioned before, deep models trained with this loss function cannot correctly classify all test samples that can be correctly classified by deep models trained with other loss functions. The basis of MILS lies in the assumption that various loss functions have various feature preferences, and these feature preferences can be learned stacked in one training process.

Figure 2.

Experimental results for motivation. (a). Test error on the CIFAR-100 validation dataset (softmax loss vs. multi-margin loss). (b). We independently train the network ResNet-20 by softmax loss 10 runs, and 959 samples in the CIFAR-100 validation dataset always misclassified by any model trained before. (c). 92 samples in (b) have been correctly classified by the ResNet-20 model trained by multi-margin loss.

In order to verify this assumption, we empirically conduct a classification experiment on the CIFAR-100 dataset [27]. To evaluate the performance of softmax loss and multi-margin loss, we use softmax loss and multi-margin loss to train networks ResNet-20 [6] separately, except for the choice of loss functions, other hyperparameters are the same. Figure 2a shows that the model performance trained by softmax loss significantly outperforms the model trained by multi-margin loss and achieves the lowest error rate of 31.3%. We have two interesting and important observations in Fig. 2b and c. First, we have trained ResNet-20 (ResNet with a depth of 20) 10 runs independently with the softmax loss. Therefore we got 10 ResNet-20 models with different model parameters. We found 959 fixed samples (9.59% of the total number of the validation dataset in the CIFAR-100 dataset), which failed to be correctly classified by all ResNet-20 models trained before. On the other hand, there are 4116 fixed samples (41.16% of the total number of the validation dataset in the CIFAR-100 dataset) that can always be correctly classified by any ResNet-20 model. We have grounds to believe that the features related to the 959 fixed samples mentioned above are challenging to learn by the model trained with the softmax loss. Second, although the softmax loss significantly outperforms multi-margin loss in classification accuracy of ResNet-20, 92 in 959 fixed samples mentioned previously have been correctly classified by the ResNet-20 model trained by multi-margin loss. The phenomenon discussed above reflects that a specific loss function is sensitive to some features in the feature space of the training data but not sensitive to other features.

Based on the above observations, we try to find a general way to combine multiple loss functions independently to join the training process for learning more meaningful features. However, in the training process using multiple independent loss functions, some loss functions may inevitably degrade network performance. As a result, deploying multiple loss functions in training to efficiently optimize deep models is a challenging task in practice. How do different loss functions take effect separately in training? What changes need to be made to the training process? How to measure the effects of these loss functions? How to schedule them for achieving better performance? In the Section 4, we will present a detailed design based on MILS strategy.

4. MILS: Multiple independent losses scheduling

4.1 Intuition

In common, only one single loss function is involved in the training process for typical classification tasks. Sometimes, this single loss function can be formalized as a linear weighted sum of multiple loss functions. It means that there is only one same loss function that will be used for training deep neural networks in all training iterations. Why not use different loss functions to train the deep model in different training iterations separately?

Inspired by this assumption, we try to search for a general scheme to combine multiple independent loss functions during the training process. It is easy to find that the fundamental issue of this assumption is how to ensure that multiple loss functions do not degrade the model performance during the training process. Thus, we need an evaluation criterion to measure the impact of one loss function on the model performance in a training iteration. In order to identify the impact of loss function on model performance, we first introduce a conception of an observation dataset and evaluate the impact of the current loss function on the model based on the model’s classification accuracy on the observation dataset. In general, creating a new dataset is very difficult, so randomly copying some samples from the training dataset is a reasonable way to build an observation dataset. Suppose one loss function gives a negative effect on the model performance, which is evaluated by the classification accuracy on the observation dataset using deep models updated by this loss function. In that case, we need to reload the last model parameters and use other loss functions to train the model again. Consequently, if one loss function fails to improve the model performance, it will bring twice parameter updating and one parameter reloading operations in this training iteration. That means more overheads are generated to reduce the negative effect of this loss.

As mentioned above, training deep models using multiple independent loss functions may inevitably bring high computational costs. Thus, we need a scalable framework that schedules the participation of multiple loss functions in one training process and alleviates the related computation costs to an acceptable level.

4.2 Framework

Figure 3.

The illustration of MILS. If the $p\geqslant\mathcal{P}$ , model parameters will be updated using the primary loss function. Otherwise, the model may be updated using an auxiliary loss function.

As shown in Fig. 3, we develop a framework that can combine multiple different loss functions for training deep neural networks. Specifically, for all candidate loss functions, one loss function will be chosen as the primary loss function (for example, we select softmax loss as primary loss function) before the training start, and the other loss functions (such as multi-margin loss, center loss, focal loss, and L-GM loss) will play auxiliary roles for the possible contribution to improve the model performance. For example, if we use softmax loss and center loss as primary loss and auxiliary loss, respectively. This MILS version loss can be formulated as:

$\displaystyle\mathcal{L}=\left\{\begin{array}[]{ll}-\sum_{i=1}^{m}\log\frac{e^% {W_{y_{i}}^{T}x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n}e^{W_{j}^{T}x_{i}+b_{j}}},&\text% {if }1-\mathcal{P}\\ \frac{1}{2}\sum_{i=1}^{m}||x_{i}-c_{y_{i}}||_{2}^{2},&\text{if }\mathcal{P}% \end{array}\right.$ (1)

For the MILS strategy, we define a participation ratio as $\mathcal{P}\in[0,1]$ to indicate the probability that auxiliary losses may update weight parameters of the deep model. For a model with parameters of $\textit{para}_{i-1}$ after $(i-1)$ iterations training, a random probability $p\in[0,1]$ is initialized to decide whether auxiliary losses are deployed at the $i$ -th training iteration. If $p\geqslant\mathcal{P}$ , as shown in the top part of Fig. 3, the model parameters will be updated at the $i$ -th iteration by the primary loss function (usually softmax loss), and new model parameters $\textit{para}_{i}$ are obtained after $i$ -th iterations training. Otherwise, if $p<\mathcal{P}$ , as shown in the bottom part of Fig. 3, before the start of the $i$ -th iteration, samples from the observation dataset are input to the model with parameters $\textit{para}_{i-1}$ for scoring its performance as $\textit{score}_{1}$ . Then, the auxiliary loss will be deployed to train this model. After the model’s parameters are updated by an auxiliary loss, we will evaluate the performance of the model with new parameters $\textit{para}_{i}$ on the observation dataset. Accordingly, the score of model performance $\textit{score}_{2}$ will be obtained. If $\textit{score}_{2}>\textit{score}_{1}$ , this auxiliary loss brings a positive effect on the model performance, the model parameters $\textit{para}_{i}$ should be kept for the next training iteration. Otherwise, we reload the model parameters $\textit{para}_{i-1}$ , and network parameters will be trained with the primary loss functions again and updated as $\textit{para}_{i}$ . In essence, training with multiple independent losses can be regarded as a heuristic scheduling algorithm, and we will discuss the details of this strategy in the next subsection.

4.3 Scheduling algorithm

MILS can be regarded as a scheduling algorithm that schedules the use of loss functions in the training process based on their performance on the observation dataset. The pseudocode of MILS is shown in Algorithm 4.3. MILS has two main hyperparameters, i.e., participation rate $\mathcal{P}$ and observation parameter $k$ . The observation dataset is randomly picked $k$ training batchsize samples from the training dataset.

The pseudo code of MILS scheduling algorithmThe model and optimizer parameters after $(i-1)$ interations training: $\textit{Para}_{i-1}$ and $\textit{Optim}_{i-1}$ ; The participation rate $\mathcal{P}$ and the obeservation parameter $k$ ; Initialize the probablity for iteration $i$ as $p$ . $p<\mathcal{P}$ save ( $\textit{Para}_{i-1}$ );save ( $\textit{Optim}_{i-1}$ ); Obeservation Dataset (OD): Randomly picks $k$ training bacth size samples from the training dataset $\textit{score}_{1}=$ Accuracy (Model with $\textit{Para}_{i-1}$ test on the OD); Update model parameters as $\textit{Para}_{i}$ by Auxiliary Loss; $\textit{score}_{2}=$ Accuracy (Model with $\textit{Para}_{i}$ test on the OD); $\textit{score}_{2}<\textit{score}_{1}$ Reload ( $\textit{Para}_{i-1}$ );Reload ( $\textit{Optim}_{i-1}$ );Update model parameters as $\textit{Para}_{i}$ by Primary Loss; Update model parameters as $\textit{Para}_{i}$ by Primary Loss;

In MILS, we initialize a probability $p$ for a new training iteration and use $\mathcal{P}$ to control the probability of the auxiliary loss function participating in the training process. And we use the value of $k$ to adjust the scoring criteria for the model performance. The scoring criteria used in the MILS method is accuracy which refers to how close a predictive value is to the true label. Intuitively, the larger $\mathcal{P}$ means that the model can learn more features that are sensitive to the auxiliary loss function. On the other hand, different gradient directions generated by different loss functions may degrade the model performance; it is hard to guarantee that all gradients yields from loss functions are favorable to this model.

As shown in Algorithm 4.3, in every training iteration, we will randomly pick $k\times\textit{training}\_\textit{batchsize}$ samples from the training dataset to score the model performance. Obviously, the larger $k$ represents an increase in computational cost required during the training process. In our opinion, too small $k$ may lead to inaccurate evaluation of model performance, and too large $k$ may lead to an insufficient update of the model parameters by the auxiliary loss function. Notably, MILS is only involved in the training process, and the softmax loss is the unique loss used in the testing process.

5. Experiments

We conduct extensive experiments to test the effectiveness of our proposal on the prevalent SVHN [28], CIFAR10/100 [27], and Tiny-Imagenet datasets1

¹
http://tiny-imagenet.herokuapp.com/.

by using different popular deep network architectures as AlexNet [2], VGGNet [29], ResNet [6], PreResNet [30], and Densenet [31]. We use five different loss functions as the candidates for our proposal, i.e., softmax loss, multi-margin loss, focal loss, center loss, and L-GM loss, respectively. For simplicity, in most cases, we use the softmax loss as the primary loss, and other loss functions play auxiliary roles for possible contribution to the model performance. In order to alleviate the computational cost, we do not evaluate the model performance in every iteration trained by the primary loss function.

In the training process, only one loss will be chosen as an auxiliary loss. All experiments are carried out using the PyTorch framework2

https://pytorch.org/.

on 4

\times

NVIDIA P100 or 4

\times

M40 GPUs. Since baselines are usually overfitted for the long training scheme and have lower validation accuracy at the end of the training, we record the highest validation accuracy over the full training course for a fair comparison. In this paper, if not specified, all experimental results that have been reported are averaged over 4 runs.

5.1 Effectiveness of MILS on image classification

SVHN

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data pre-processing and formatting. The SVHN dataset contains 73,257 training samples and 26,032 testing samples.

We compare the softmax loss, the center loss, and their MILS versions (softmax loss as the primary loss, center loss as the auxiliary loss respectively with the participation rate of $\mathcal{P}=0.4$ and observation parameter $k=1$ ) by visualizing their learned 2D feature spaces for the SVHN dataset on ResNet-20 (ResNet with a depth of 20). The feature embeddings on the training dataset with different loss functions are illustrated in Fig. 4. Obviously, compared with the original center loss, MILS version center loss brings stronger intra-class compactness and larger inter-class separability.

Table 1
The baselines and experimental results of the ResNet-20 trained with various loss functions and MILS loss on the SVHN dataset

Baseline
Softmax	Multi-margin	Focal	Center	L-GM
4.29	5.25	5.24	3.80	4.19
MILS
Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
Softmax	Multi-margin	0.01	1	4.13
	Focal	0.01	1	4.15
	Center	0.4	1	3.57
	L-GM	0.01	1	4.03

Figure 4.

Two-dimensional feature embedding on the SVHN training set. (a) Softmax loss. (b) The joint supervision of softmax loss and center loss [11]. (c) MILS schedule with center loss as auxiliary loss function with participation rate of $\mathcal{P}=0.4$ and observation parameter of $k=1$ .

We select the softmax loss as the primary loss and choose one of multi-margin loss, center loss, and L-GM loss as the auxiliary loss to train the network ResNet-20. We set participation rate $\mathcal{P}=0.01$ , and randomly select $k$ training batchsize from the training dataset as the observation dataset, here $k=1$ . We trained networks for 160 epochs, and the learning rate is initially set to 0.1 and then divided by 10 at the 60th epoch and the 120th epoch. The weight decay is set to $1\times 10^{-4}$ . For a fair comparison, all training parameters are identical, including the learning rate, weight decay, etc.

The classification accuracies on the test dataset of the SVHN dataset are present in Table 1. We can see that the multi-margin loss is a little weak compared with the four other loss functions in model performance. As shown in Table 1, if we use MILS to schedule softmax loss and multi-margin loss, the model performance (test error) has achieved 4.13%. Compared with 4.29% and 5.25%, the baselines of softmax loss and multi-margin loss, MILS significantly improves the model performance with very little overhead. Center loss (the joint supervision of softmax loss and center loss to train the deep models) has a very excellent performance on the SVHN dataset, but if we use MILS to combine softmax loss and center loss (only center loss involved) ( $\mathcal{P}=0.4$ , $k=1$ ), it achieves better results than the center loss baseline. It should be noticed that center loss significantly outperforms softmax loss on the SVNH dataset. Therefore, participation rate $\mathcal{P}=0.01$ can not provide enough contribution to the model performance. As a result, MILS (center loss with $\mathcal{P}=0.01$ and $k=1$ ) fails to outperform the baseline of center loss. However, if we raise the value of participation rate $\mathcal{P}$ , the MILS version center loss will perform better than the original center loss. In fact, by increasing the size of the observation dataset, we can further improve the model performance, but it will bring a high computational overhead.

CIFAR-10 and CIFAR-100

CIFAR-10 and CIFAR-100 each consists of 32 $\times$ 32 pixel colored images, with 50K training images and 10K testing images. We adopt the standard data augmentation scheme including mirroring at the probability of 0.5 and 32 $\times$ 32 random cropping after 4 pixel zero-paddings on each side.

Table 2

The baselines and experimental results of the PreResNet-20 and VGG11-bn models trained by various loss functions and MILS losses on the CIFAR-10 dataset

VGG11-bn	Baseline
	Softmax	Multi-margin	Focal	Center	L-GM
	8.26	8.19	8.32	7.75	8.02
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.01	1	8.01
		Focal	0.01	1	7.98
		Center	0.4	1	7.51
		L-GM	0.05	1	7.76
PreResNet-20	Baseline
	Softmax	Multi-margin	Focal	Center	L-GM
	7.89	7.83	7.63	7.35	7.55
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.01	1	7.76
		Focal	0.01	1	7.42
		Center	0.4	1	7.16
		L-GM	0.05	1	7.43

For the CIFAR-10 dataset, we trained the VGG11-bn and PreResNet-20 (PreResNet with a depth of 20). If not specified, the networks are trained with a batchsize of 128 for 300 epochs. The learning rate is initially set to 0.1 and then divided by 10 at the 150th and 225th epoch. The weight decay is set to $1\times 10^{-4}$ . The softmax loss has been predefined as the primary loss function, and one of the multi-margin loss, focal loss, center loss, and L-GM loss has been chosen as the auxiliary loss function. Both two loss functions are participating in the training process according to the participation rate $\mathcal{P}$ . The recognition accuracies are shown in Table 2. In order to obtain the baselines, we train the models by ourselves since the baseline reported in the original papers is relatively lower than our train results, or there is no report on the CIFAR-10 dataset in the original papers. The proposed MILS version loss functions outperform baselines of softmax loss, multi-margin loss, focal loss, center loss, and L-GM loss on VGG Model with various depths.

Table 3

The baselines and experimental results of deep models trained by various loss functions and MILS losses on the CIFAR-100 dataset

AlexNet	Baseline
	Softmax	Multi-margin	Focal
	55.85	58.90	60.07
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.02	5	55.64
		Focal	0.02	5	55.37
VGG19-bn	Baseline
	Softmax	Multi-margin	Focal
	27.19	28.45	27.28
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.01	20	26.85
		Focal	0.1	20	26.83
ResNet-50	Baseline
	Softmax	Multi-margin	Focal
	26.33	29.60	26.11
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.01	1	25.72
		Focal	0.01	1	25.78
DenseNet-100	Baseline
	Softmax	Multi-margin	Focal
	22.71	24.33	22.68
	MILS
	Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
	Softmax	Multi-margin	0.1	1	22.13
		Focal	0.1	1	22.25

For the CIFAR-100 dataset, we trained the AlexNet, VGG19-bn, ResNet-50 (ResNet with a depth of 50), and DenseNet-100 (DenseNet with a depth of 100, DesNet-100 is trained with a batch size of 64 for 300 epochs). The softmax loss has been predefined as the primary loss function, one of the multi-margin loss, focal loss, center loss, and L-GM loss has been regarded as an auxiliary loss function. The auxiliary loss is participating in the training process according to the participation rate $\mathcal{P}$ . The recognition accuracies are shown in Table 3. All MILS version loss functions outperform the baselines of the above models trained with various loss functions. It reflects that MILS is a nice framework to combine different loss functions for improving model performance with good generalization.

Tiny-ImageNet

The Tiny-ImageNet dataset is a subset of the ImageNet dataset with 200 classes. It has images with 64 $\times$ 64 resolution with 100k and 10k samples for the training and validation datasets. We use the validation dataset to evaluate the model performance for the test dataset label that is not publicly available. We trained ResNet-50 with a batchsize of 128 for 164 epochs, and the learning rate is initially set to 0.1 and then divided by 10 at the 81st epoch and the 122nd epoch. The weight decay is set to $1\times 10^{-4}$ .

Figure 5.

Pale colour lines denote the counts of total parameters updated by auxiliary loss in an epoch, and deep colour lines denote the counts of successful parameter updates by auxiliary loss in an epoch. Point-fold line denotes the test error(%) of deep models trained by MILS version loss(softmax loss as primary loss, multi-margin loss as auxiliary loss).

Table 4

The baselines and experimental results of Resnet-50 model trained by various loss functions and MILS losses on the Tiny-imagenet dataset

Baseline
Softmax	Multi-margin	Focal
48.92	99.76	48.67
MILS
Primary loss	Auxiliary loss	P-Rate ( $\mathcal{P}$ )	O-Param ( $k$ )	Accuracy
Softmax	Multi-margin	0.01	1	47.23
	Focal	0.01	1	47.60

We train the network ResNet-50 to evaluate the effectiveness of MILS on the Tiny-ImageNet dataset. The size of input images (training samples or testing samples) has been resized from 64 $\times$ 64 to 32 $\times$ 32. The softmax loss is denoted as a primary loss function, and one of the multi-margin loss, focal loss, and center loss has been denoted as auxiliary loss function. The baselines of softmax loss, multi-margin loss, and focal loss achieve 48.92%, 99.76%, and 48.67% test error on the validation set, but their MILS version loss functions perform better than the three baselines mentioned before as shown in Table 4.

We also investigate the successful updating ratio of the auxiliary loss function to develop a deep understanding of the training process based on MILS. When we only use the multi-margin loss to train the ResNet-50 on the Tiny-Imagenet dataset, it fails to train this network, and the final best training and test errors are 99.44% and 99.76% respectively. However, if we combine the softmax loss with multi-margin loss ( $\mathcal{P}=0.01$ , $k=1$ ) to train the same network, in the best case, the final best error ratio achieves 46.93%. It denotes that the multi-margin loss makes a significant contribution to model performance from 48.81% (the best test error trained by softmax loss) to 46.93%. Figure 5 shows the counts of model parameters updated by multi-margin loss. We can find that the best test error 46.93% is achieved at the 82nd training epoch. It should be noticed that the total number of model parameters updated made by multi-margin loss is 313, and the total number of the successful model parameters updated by multi-margin loss is 28 before the 82nd training epoch. It suggests that MILS has the potential to improve the model performance with extremely little extra cost.

5.2 Effect of participation rate

\mathcal{P}

and observation parameter

k

Figure 6.

How the batch training time changes associated with $\mathcal{P}$ and $k$ .

Figure 7.

The deep colour point-fold line denotes the test error of ResNet-50 model trained by multi-margin loss, and the pale colour point-fold line denotes the test error of ResNet-50 model by MILS loss (softmax loss $+$ multi-margin loss). The short lines denote that there is an update made by multi-margin loss in an epoch.

This section will explore the extra computational cost associated with MILS. In every iteration, $\mathcal{P}$ denotes the probability that the model is trained by the auxiliary loss function. Based on the CIFAR-100 dataset, we adjust the value of participation rate $\mathcal{P}$ and the value of observation parameter $k$ to observe the changing of training batch time on the network AlexNet. The value of $k$ may bring a high computational cost because we need to check the model performance twice in every iteration trained by the auxiliary loss function. Consequently, the value of $k$ determines the computational cost of every iteration trained by the auxiliary loss function. The value of $\mathcal{P}$ determines the probability that an iteration is trained by the auxiliary loss function, and the higher $\mathcal{P}$ is, the more cost may be generated. In order to quantitatively evaluate the effect of $\mathcal{P}$ and $k$ on batch training time, we use MILS which combines softmax loss and multi-margin loss to train the AlexNet model. As shown in Fig. 6, we can find the batch training time will increase sharply from 11ms to 260ms when $\mathcal{P}$ and $k$ increase at the same time. Fortunately, in most cases, MILS can bring an obvious performance improvement with very little cost when participation rate $\mathcal{P}=0.01$ and observation parameter $k=1$ . For example, if $\mathcal{P}=0$ and $k=0$ , it means that we train AlexNet only by softmax loss, and we can find that the batch training time (11ms) is the same with model trained with MILS version multi-margin loss with $\mathcal{P}=0.01$ and $k=1$ .

In fact, high computational or cache costs does not denote that the model can always raise its performance. We have two simple observations. First, if the high computational cost is mainly caused by high $\mathcal{P}$ , it means too many gradients yielded from the auxiliary loss function and may lead to malignant gradient variation. Thus, the model performance will be affected. Second, suppose the high computational cost is mainly caused by high $k$ . In that case, the counts of parameters updated by the auxiliary loss function may decrease, and the auxiliary loss function could not introduce sufficient knowledge about its preference meaningful features of training data to the model.

5.3 Effect of scheduling strategy

We use a scheduling algorithm to avoid that auxiliary loss function introducing a negative effect on model performance. In order to verify that whether our scheduling mechanism plays a key role in MILS or not, we conducted the following experiments. Figure 7 shows that multi-margin loss fails to train ResNet-50 on the Tiny-Imagenet dataset. We also combine softmax and multi-margin loss to train this network without evaluating operation. Each epoch has a 20% chance to randomly select a batch to use the multi-margin loss to update the model parameters. Experiment results reveal that the model performance will be seriously affected by the auxiliary loss function, even if the model only updated by multi-margin loss several times without using a scheduling mechanism. We also trained ResNet-50 by MILS (softmax and multi-margin loss with scheduling strategy, $\mathcal{P}=0.01$ and $k=1$ ) 10 runs independently, and there is no collapse occurred in the whole training process.

6. Conclusion and discussion

In this work, we introduce MILS to combine multiple different loss functions for the purpose of model performance improvement. As far as we know, we are the first to consider the idea that schedules the multiple loss functions independently and alternately in the training process according to their performance. We demonstrate that MILS is a new kind and more effective version of multiple loss function compared with joint loss functions on various deep models and datasets. Traditional multiple loss (joint loss) can be regarded as forming a new single loss through fixed linear weighting in space, while MILS makes multiple independent losses interacting in training time series. Compared with traditional joint loss functions, their MILS version has better flexibility and performance with very little extra overhead.

Our experiments show that with only 1% participation rate of the auxiliary loss function and 1 training batchsize image as the observation dataset, MILS still can achieve good performance for accuracy improvement. Extensive experiments using various deep models on various recognition benchmarks demonstrate our scheme is simple, robust, lightweight, and effective for typical classification tasks.

However, there’s still some limitations in this work. This paper only evaluates the practicality of MILS on typical object recognition benchmarks. In the future, we will test the effectiveness of MILS on object recognition benchmarks with factors such as noise and imbalance.

Footnotes

Acknowledgments

This work was supported by the Medico-Engineering Cooperation Funds from University of Electronic Science and Technology of China (No. ZYGX2021YGLH213, No. ZYGX2022YGRH016), the Municipal Government of Quzhou (Grant 2021D007, Grant 2021D008, Grant 2021D015, Grant 2021D018, Grant 2022D018, Grant 2022D029), as well as the Zhejiang Provincial Natural Science Foundation of China under Grant No. LGF22G010009.

References

LeCun

Boser

Denker

J.S.

et al., Backpropagation applied to handwritten zip code recognition, Neural Computation 1(4) (1989), 541–551. doi: 10.1162/neco.1989.1.4.541.

Krizhevsky

Sutskever

and Hinton

G.E.

, ImageNet Classification with Deep Convolutional Neural Networks, in: Proceedings of the 26th Annual Conference on Neural Information Processing Systems Pereira

Burges

C.J.C.

Bottou

et al., eds, Curran Associates, Inc., 2012, pp. 1106–1114. https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html.

Hinton

Deng

et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine 29(6) (2012), 82–97. doi: 10.1109/MSP.2012.2205597.

Dahl

G.E.

Deng

et al., Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition, IEEE Transactions on Audio, Speech, and Language Processing 20(1) (2012), 30–42. doi: 10.1109/TASL.2011.2134090.

Ioffe

and Szegedy

, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, in: Proceedings of the 32nd International Conference on Machine Learning Bach

and Blei

, eds, Proceedings of Machine Learning Research, PMLR, 2015, pp. 448–456. http://proceedings.mlr.press/v37/ioffe15.html.

Zhang

Ren

et al., Deep Residual Learning for Image Recognition, in: Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. doi: 10.1109/CVPR.2016.90.

Ren

Girshick

R.B.

et al., Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6) (2017), 1137–1149. doi: 10.1109/TPAMI.2016.2577031.

Zhu

and Savvides

, Feature Selective Anchor-Free Module for Single-Shot Object Detection, in: Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 840–849. doi: 10.1109/CVPR.2019.00093. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhu_Feature_Selective_Anchor-Free_Module_for_Single-Shot_Object_Detection_CVPR_2019_paper.html.

Chen

and Shen

, Stereo R-CNN Based 3D Object Detection for Autonomous Driving, in: Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 7644–7652. doi: 10.1109/CVPR.2019.00783. http://openaccess.thecvf.com/content_CVPR_2019/html/Li_Stereo_R-CNN_Based_3D_Object_Detection_for_Autonomous_Driving_CVPR_2019_paper.html.

10.

Liu

Wen

et al., Large-Margin Softmax Loss for Convolutional Neural Networks, in: Proceedings of the 33nd International Conference on Machine Learning Balcan

M.F.

and Weinberger

K.Q.

, eds, Proceedings of Machine Learning Research, Vol. 48, PMLR, 2016, pp. 507–516. http://proceedings.mlr.press/v48/liud16.html.

11.

Wen

Zhang

et al., A Discriminative Feature Learning Approach for Deep Face Recognition, in: Proceedings of the 14th European Conference on Computer Vision Leibe

Matas

Sebe

et al., eds, Lecture Notes in Computer Science, Vol. 9911, 2016, pp. 499–515. doi: 10.1007/978-3-319-46478-7_31.

12.

Chen

Zhang

et al., Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification, in: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1320–1329. doi: 10.1109/CVPR.2017.145.

13.

Lin

Goyal

Girshick

R.B.

et al., Focal Loss for Dense Object Detection, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, pp. 2999–3007. doi: 10.1109/ICCV.2017.324.

14.

Liu

Wen

et al., SphereFace: Deep Hypersphere Embedding for Face Recognition, in: Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6738–6746. doi: 10.1109/CVPR.2017.713.

15.

Wan

Zhong

et al., Rethinking Feature Distribution for Loss Functions in Image Classification, in: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9117–9126. doi: 10.1109/CVPR.2018.00950. http://openaccess.thecvf.com/content_cvpr_2018/html/Wan_Rethinking_Feature_Distribution_CVPR_2018_paper.html.

16.

Liang

et al., Multi-loss regularized deep neural network, IEEE Transactions on Circuits and Systems for Video Technology 26(12) (2016), 2273–2283. doi: 10.1109/TCSVT.2015.2477937.

17.

Liang

Taskar

and Klein

, Alignment by Agreement, in: Proceedings of the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Main Conference Moore

R.C.

Bilmes

J.A.

Chu-Carroll

et al., eds, Association for Computational Linguistics, New York City, USA, 2006, pp. 104–111. https://www.aclweb.org/anthology/N06-1014/.

18.

Wang

Zhou

et al., CosFace: Large Margin Cosine Loss for Deep Face Recognition, in: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5265–5274. doi: 10.1109/CVPR.2018.00552. http://openaccess.thecvf.com/content_cvpr_2018/html/Wang_CosFace_Large_Margin_CVPR_2018_paper.html.

19.

Deng

Guo

Xue

et al., ArcFace: Additive Angular Margin Loss for Deep Face Recognition, in: Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 4690–4699. doi: 10.1109/CVPR.2019.00482. http://openaccess.thecvf.com/content_CVPR_2019/html/Deng_ArcFace_Additive_Angular_Margin_Loss_for_Deep_Face_Recognition_CVPR_2019_paper.html.

20.

Zhen

Lin

Tang

A.Z.

et al., Nonlinear Collaborative Scheme for Deep Neural Networks, CoRR abs/1811.01316 (2018). http://arxiv.org/abs/1811.01316.

21.

Zhang

Lee

J.D.

and Jordan

M.I.

, L1-regularized Neural Networks are Improperly Learnable in Polynomial Time, in: Proceedings of the 33nd International Conference on Machine Learning Balcan

M.F.

and Weinberger

K.Q.

, eds, Proceedings of Machine Learning Research, Vol. 48, PMLR, 2016, pp. 993–1001. http://proceedings.mlr.press/v48/zhangd16.html.

22.

Rastegari

Ordonez

Redmon

et al., XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, in: Proceedings of the 14th European Conference on Computer Vision Leibe

Matas

Sebe

et al., eds, Lecture Notes in Computer Science, Vol. 9908, 2016, pp. 525–542. doi: 10.1007/978-3-319-46493-0_32.

23.

Tang

Hua

and Wang

, How to Train a Compact Binary Neural Network with High Accuracy?, in: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 2017, pp. 2625–2631. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14619.

24.

Mao

Lee

Tseng

et al., Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis, in: Proceedings of the 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2019, pp. 1429–1437. doi: 10.1109/CVPR.2019.00152. http://openaccess.thecvf.com/content_CVPR_2019/html/Mao_Mode_Seeking_Generative_Adversarial_Networks_for_Diverse_Image_Synthesis_CVPR_2019_paper.html.

25.

Carlini

and Wagner

D.A.

, Towards Evaluating the Robustness of Neural Networks, in: Proceedings of the 2017 IEEE Symposium on Security and Privacy, 2017, pp. 39–57. doi: 10.1109/SP.2017.49.

26.

Chen

Sharma

Zhang

H.n.

et al., EAD: Elastic-Net Attacks to Deep Neural Networks via Adversarial Examples, in: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018, pp. 10–17. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16893.

27.

Krizhevsky

and Hinton

, Learning multiple layers of features from tiny images, 2009.

28.

Netzer

Wang

Coates

et al., Reading digits in natural images with unsupervised feature learning, 2011.

29.

Simonyan

and Zisserman

, Very Deep Convolutional Networks for Large-Scale Image Recognition, in: Proceedings of the 3rd International Conference on Learning Representations Bengio

and LeCun

, eds, 2015. http://arxiv.org/abs/1409.1556.

30.

Zhang

Ren

S.g.

et al., Identity Mappings in Deep Residual Networks, in: Proceedings of the 14th European Conference on Computer Vision Leibe

Matas

Sebe

et al., eds, Lecture Notes in Computer Science, Vol. 9908, 2016, pp. 630–645. doi: 10.1007/978-3-319-46493-0_38.

31.

Huang

Liu

Maaten

et al., Densely Connected Convolutional Networks, in: Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2261–2269. doi: 10.1109/CVPR.2017.243.

Multiple independent losses scheduling: A simple training method for deep neural networks

Abstract

Keywords

1. Introduction

Joint loss functions

Regularization methods

3. Motivation

4.1 Intuition

4.2 Framework

5. Experiments

1 http://tiny-imagenet.herokuapp.com/.

SVHN

Table 1 The baselines and experimental results of the ResNet-20 trained with various loss functions and MILS loss on the SVHN dataset

CIFAR-10 and CIFAR-100

Tiny-ImageNet

6. Conclusion and discussion

Footnotes

Acknowledgments

References

¹
http://tiny-imagenet.herokuapp.com/.

Table 1
The baselines and experimental results of the ResNet-20 trained with various loss functions and MILS loss on the SVHN dataset