Abstract
Nowadays, despite the popularity of deep convolutional neural networks (CNNs), the efficient training of network models remains challenging due to several problems. In this paper, we present a layer-wise learning based stochastic gradient descent method (LLb-SGD) for gradient-based optimization of objective functions in deep learning, which is simple and computationally efficient. By simulating the cross-media propagation mechanism of light in the natural environment, we set an adaptive learning rate for each layer of neural networks. In order to find the proper local optimum quickly, the dynamic learning sequence spanning different layers adaptively adjust the descending speed of objective function in multi-scale and multi-dimensional environment. To the best of our knowledge, this is the first attempt to introduce an adaptive layer-wise learning schedule with a certain degree of convergence guarantee. Due to its generality and robustness, the method is insensitive to hyper-parameters and therefore can be applied to various network architectures and datasets. Finally, we show promising results compared to other optimization methods on two image classification benchmarks using five standard networks.
Introduction
Nowadays, artificial neural networks [1, 62] have been developed and successfully applied in various fields, such as financial analysis [3], intelligent traffic navigation [4], computer-aided diagnosis [5, 6], and production automation [7]. Deep CNNs [8, 9] are a variant which introduce convolution operations and have achieved incredible success in the computer vision based applications such as object classification [10] and instance segmentation [11], even surpassing human performance [12, 70]. At present, a very deep CNN can even be made up of more than 1 000 layers through special designs [13].
Despite their popularity, the efficient training of deep CNNs remains challenging due to several problems. These include overfitting [14, 64], vanishing and exploding gradients [15], saddle point [16], and slow convergence [17, 66]. A large number of approaches to these problems have been proposed in various ways. Different activation functions like Leaky-ReLU [18] and Swish [19] are designed to help the propagation of gradient flow; some complex layers [20–22] are added to enhance the network structure; a better initialization position [23, 24] in the parameter space is also explored. However, there are only a few studies on the stochastic learning of over-parameterized and highly nonconvex CNNs.
So far, the stochastic gradient descent (SGD) based optimization method is still the most commonly used technique for training deep CNNs. Many problems in deep learning can be viewed as the maximization or minimization of some scalar parameterized objective function (e.g., loss function) with respect to network parameters. For the convex objective function, SGD method can ensure that the network converges to the global minimum, while it usually converges to a local minimum for non-convex one.
Specifically, let
In particular, if the batch size M is large, then the gradients in a mini-batch set have a relatively large variance which slows down the network convergence. Consider the case that each loss
In fact, SGD has been empirically proved to be an efficient optimization method that plays a central role in many successful practical applications [33, 69].
SGD based optimization methods [26, 27], while increasingly popular, are usually used as black-box optimizers, as practical explanations of their working process and properties are hard to come by. On the other hand, the naive SGD cannot guarantee a good convergence position, and it poses a few challenges that need to be solved: Choosing a suitable learning rate can be difficult. A small learning rate leads to slow convergence rate, while too large learning rate hinders model convergence, resulting in the loss fluctuating or even deviating from the minimum value. It may be harmful to adopt a fixed learning rate for updating all the parameters. Training data in high dimensional space are usually sparse and thus features have various occurring frequencies. Under different scales or dimensions, it cannot adapt to the representation of features. Some optimization strategies containing dynamic learning rates have been applied to the whole training process, e.g., annealing [28]. However, these schedules or thresholds have to be configured in advance and therefore cannot adapt to the different networks or datasets.
Contributions
In this paper, we propose the layer-wise learning based stochastic gradient descent (LLb-SGD) method for accelerating and improving the optimization of deep CNNs. By simulating the cross-media propagation mechanism of light in the natural environment, i.e., light has different propagation speeds in different media, we set an adaptive learning rate for each layer of neural networks. To find the proper local optimum quickly, the dynamic learning sequence spanning on various layers adaptively adjust the descending speed of objective function in the multi-scale and multi-dimensional environment. It is calculated on the per-layer basis using first-order information and requires only a trivial amount of extra computation per iteration over gradient descent. Intuitively, LLb-SGD can escape from saddle points [31] and sharp local minimums [32], which are unable to generalize well on unseen data. The descent method based on the layer-wise learning rate sequence results in a large variance, which enhances the impact of batch noises in these cases and helps the model escape from sharp local minimum and saddle points. Moreover, our approach can dynamically incorporate distribution information observed in earlier training iterations to perform the more informative gradient-based learning. This is, to the best of our knowledge, the first time to introduce an adaptive layer-wise learning schedule with a convergence guarantee and give experimental verification on two image classification benchmarks and six standard CNN architectures (ResNet-34/50 [13] for CIFAR-10 [29]; AlexNet [20], VGG-16 [8], GoogLeNet [9] and ResNet-152 for ImageNet [30]).
In summary, the contributions and benefits of this approach are as follows: LLb-SGD introduces the adaptive and dynamic learning rates for each layer in deep CNNs. It is suitable for sparse gradients and naturally performs step annealing. The sequence of learning rates is insensitive to hyper-parameters (e.g., parameter initialization, batch size and kernel size) during training iterations. Due to its generality, our method can be applied to various network architectures and datasets. We analyze the theoretical convergence properties of LLb-SGD and provide a regret bound on the convergence rate. We empirically prove that it is not necessary to set a separate learning rate for each parameter, while ensuring network convergence. Experimental results demonstrate that the LLb-SGD accelerates and improves network convergence, which outperforms state-of-the-art methods.
Organization
The remainder of the paper is organized as follows. Section 2 gives a brief review to the related work on optimizing deep neural networks. In Section 3, we introduce the proposed LLb-SGD method and update rule of network parameters. Section 4 provides the theoretical analysis of the LLb-SGD’s convergence conditions in non-convex optimization. Experimental results and comparisons are presented in Section 5. Finally, we conclude our work and future directions in Section 6.
Related work
In view of the theory incompleteness and existing problems of SGD, many analytical modifications and improvements [55–60] have been proposed.
To solve the problem that it is difficult for SGD to escape from the steep local minimums, an additional momentum [36] was added to help the network cross bad positions. Nesterov-accelerated gradient (NAG) descent method [37] further introduced approximate gradient of the next position to predict the future direction, rather than blindly decreasing along the slope. This predictive update prevents the objective from falling too quickly and enhances the responsiveness of the algorithm. Adagrad [38] proposed adapting the learning rates to the parameters, i.e., to adopt a larger learning rate for features with lower frequency and a smaller learning rate for the more frequent features. Therefore, it is suitable for sparse data and greatly improves the robustness of SGD, which has been successfully applied to the training of GoogLeNet by Dean et al. [39]. However, the squared term of the accumulated gradient in Adagrad method leads to an infinitely small learning rate, and ultimately no additional information can be extracted. Adadelta [40] improved the problem of monotonously decreasing learning rate in Adagrad by limiting the window size of calculating historical gradient to a fixed value. The learning rate no longer needs to be set manually during the iteration, since it has been removed from the parameter update rule. In practical applications, the objective function during training iterations is easy to diverge. The RMSProp [41] divided the learning rate for a weight by a running average of recent gradient magnitudes for that weight.
Similarly, the Adam [42] adaptively calculated the learning rate of each parameter according to the first and second moments of respective gradients, which is appropriate for non-static targets or problems with noisy or sparse gradients. Kingma et al. [21] pointed out that as gradient becomes more and more sparse in the later stage of iterations, deviation correction can help Adam to be weakly superior to RMSProp. The authors also discussed Adamax, a variant of Adam through extending the L2-norm in parameter update rule to L∞-norm, which provides a simpler bound of magnitude of network parameter updates. Nesterov-accelerated Adam (Nadam) [43] proposed to modify the momentum term in Adam by taking advantage of the insights from NAG. Dauphin et al. put forward another adaptive learning method ESGD [44] based on the equilibration preconditioner, which explored how negative eigenvalues of the Hessian help design more appropriate learning rate schemes.
There are also some computational complex high-order algorithms, such as L-BFGS [45] and Quasi-Newton method [46], which are not suitable for high-dimensional data in practice and therefore we have not discussed them here. On the other hand, when there are a large number of available data and cheap computing power, it is a feasible way to use distributed computation to accelerate network training. To this end, many asynchronous or parallel SGD have been developed, such as Downpour SGD [47] and Elastic Averaging SGD [48]. Although asynchronous running is fast, the non-ideal communication between clients can lead to poor convergence.
Many other strategies are also cultivated to help network training, including curriculum learning (i.e., adjustment of input order of training samples) [49], model structure improvement (e.g., skip connection [13] and Batch normalization [20]), noisy learning (random labeling [51] and corrupted data [52]), etc. These methods do not conflict with our optimization method, and can achieve better results by combining with each other.
In fact, the efficient training of neural networks is highly dependent on network architecture, optimizer selection, initialization positions and a variety of other considerations. Unfortunately, the impact of these choices on the loss landscape of underlying objective function is unclear. Researchers hope to analyze and understand the convergence process of network by observing the descent trajectory of objective function, so many visualization methods of iterative process have been studied [32, 54]. Due to the oversize dimensional parameter space brought about by the over-parameterized deep learning model is too complex, researchers generally proceed along the random direction from the convergent position to depict the loss landscape. Therefore, how to find a meaningful direction and accurately describe the gradient descent process is a hot research topic, which provides the possibility to build the implicit relationship between non-convex structures of objective function and the trainability of deep neural networks.
LLb-SGD method
Motivation and inspiration
Fixed learning rate is not conducive to the training of the network in later iterations, and it is expensive to prepare a separate learning rate for each parameter. In nature, the propagation speed of light in media of different densities is different due to frequency variations. Similarly, each layer in the neural network can be viewed as a medium of different scales to transfer gradients in the back propagation phase, as shown in Fig. 1. According to many visualization results [43, 54], the features extracted by convolutional kernels in the same layer are at the same semantic level. The former layers extract the low-level features including different colors and edges, while the latter features become more advanced and abstract. Therefore, inspired by the propagation mechanism of light in various media, we consider setting a separate learning rate for each layer of neural network instead of each parameter, and the learning rate for each layer can be determined by all the parameters of that layer in a deep CNN.

Each layer in the neural network can be viewed as a medium of different scales to transfer gradients in the back propagation phase.
Given a deep CNN model

The entire training and testing diagram of deep CNN.
The specific learning rate update rule is described in this part. Because of the difference of connection modes of the convolutional (conv) layer and the fully connected (fc) layer, their respective learning rate can be calculated separately in a clean and elegant pattern. During the whole network training process, only the first-order information, i.e. the L1-norm of parameter variation, needs to be additionally calculated.
Convolutional layer
In the convolutional layers of CNN, the weights of all convolution kernels that neurons connected to the inputs are fixed, which means that each kernel only focuses on the one type of features. During the back propagation in the t-th training iteration, the average Frobenius-norm of the variation of all convolution kernels in the k-th layer can be calculated by
On the other hand, we decide the descent direction by considering the current value of the training loss and its change direction, as given by
Finally, the learning rate for the k-th layer at t-the training iteration is reduced in a negative exponential form, which can be formulated by
So far, the specific decay strategy of learning rate sequence for convolutional layers in deep CNN has been fully introduced.
For the fully connected layers in a deep CNN, each neuron is connected by a single weight and therefore the calculation of the average Frobenius-norm in the iterative process can be simplified to
It can be clearly seen that the overall gradient of the last iteration is also considered in this time, which helps to optimize with only first-order information. Therefore, only a small amount of extra computation is introduced.
The decay term ξ of learning rate and the descent direction δ of fully connected layers are consistent with that of convolutional layers. Then the learning rate for fully connected layers can be updated according to Equation (10). Finally, we can obtain the learning rate sequence
Although many advanced networks have replaced the fully connected layer with global average pooling operation [9, 13], because the excessive parameters brought about by the fully connected layer affect the network generalization on the unseen data. However, Zhang et al. [50] found that the fully connected layer can act as a “firewall” in the transfer of network representation capacity by providing a large capacity of network architecture, especially if the source domain and the target domain differ greatly. In fact, it is still meaningful to discuss the influence of learning rate decay strategy on the optimization of fully connected layers, because the sufficient optimization of an over-parameterized non-convex model is the basis of good generalization.
Each update of the model parameters in the training process requires M forward propagation steps and one back propagation step.
In the forward propagation process, sample images
Then the final output (prediction result) of deep CNN is
In the end, the training loss of a sample image
In the back propagation process, parameters (i.e., weights and biases) in k-th layer at the t-th training iteration can be updated by
Through continuous alternate iterations of the above two steps, the convergent network
Till now, the update rule of parameters in the training process of deep CNN model has been fully described.
The pseudo-code of LLb-SGD, shown in Algorithm 1, is quite straightforward.
In the evaluation phase, the final training loss is used to observe the effectiveness of LLb-SGD, and the convergent CNN model is tested on the test set to evaluate its actual classification performance.
We discuss the convergence of deep CNNs under LLb-SGD using the online learning framework [27].
Given a sequence of convex objective functions {Γ1 (θ1) , . . . , Γ
T
(θ
T
)} in the training process of deep CNN, our goal is to update network parameters to minimize the objective function Γ
t
(θ
t
) as much as possible at t-th iteration. The regret RT in the network optimization is defined as the sum of the differences between all the previous objectives Γ
t
(θ
t
) and the convergent one Γ
T
(θ
T
), and is calculated by
Then we can obtain the regret boundary of LLb-SGD, as given by
Proof. First, by using Lipschitz assumption in [31], we can easily get (
So,
Then we attempt to bound the regret without playing action θT at t-th iteration. By induction and recursion,
From the learning rate update rules given in Equation (10) and the classic Robbins-Monroe condition [67, 68] during training process as follows
We can further obtain the following inequality:
By plugging this into the Equation (25), we can obtain the regret bound shown in Equation (22) with a complexity of O (log N / T).
Here we only discuss Euclidean geometry without considering the gradient descent on a non-Euclidean geometry. Moreover, the loss function is usually in a non-Markovian state in practice, depending not only on the nearest vector, but also on the previous vectors.

Details of network architecture for image classification.

Some examples of images in CIFAR-10 (first row) and ImageNet (second row).
Experimental setup
Baseline CNNs and benchmark datasets
We apply LLb-SGD method to six classical CNN structures, including the ResNet-34/50 for CIFAR-10, and AlexNet, VGG-16, GoogLeNet and ResNet-152 for ImageNet. AlexNet and VGG-16 are typical deep CNN models, which are composed of convolutional layer, down-sampling layer and fully connected layer. Residual blocks and Inception modules are added to ResNet and GoogLeNet, respectively. Furthermore, the fully connected layers have been replaced by the global average pooling layers. Details of above deep CNN architectures are shown in Fig. 3.
Two benchmark datasets CIFAR-10 and ImageNet represent coarse-grained and fine-grained classification tasks respectively, which are used to evaluate the effectiveness of optimization methods under different difficulties. ImageNet is a large scale dataset, which contains 1 000 classes of 1.2 million natural images for training and 50 000 images for testing. CIFAR-10 is a popular benchmark for small-scale classification, which contains 10 classes of 50 000 images for training and 10 000 images for testing. Some examples of images in two datasets are shown in Fig. 4.
Hyper-parameters setting in the training of deep CNN on two datasets
Hyper-parameters setting in the training of deep CNN on two datasets
Final training and test loss of six deep CNNs on CIFAR-10 and ImageNet

Training curves of various deep CNNs on CIFAR-10 and ImageNet.
Before starting training, network hyper-parameters including initialized learning rate, batch size, dropout rate, weight decay, and threshold for two datasets are set separately, as shown in Table 1. All the training and testing process of various deep CNNs are carried out under the TensorFlow deep learning framework [53], based on the workstation consisting of an Intel Core i9-7980XE CPU, two NVIDIA GeForce GTX Titan XP GPUs, 2×16 gigabytes of memory, and 2 terabytes of storage. To make a impartial comparison with state-of-the-art methods, no data augmentation techniques are used during the training process.
Performance comparison
To confirm the theoretical results and insights, we experimented with LLb-SGD method in comparison with the naive SGD, path-SGD, NAG, Adagrad, Adam, Adadelta, RMSProp, and ESGD. Five-fold cross-validation results are used as criteria for evaluation of various methods. The final training and test losses are summarized in Table 2. Training curves of six deep CNNs using above optimization methods are shown in Fig. 5.
Training and test loss of six deep CNNs on CIFAR-10 and ImageNet under different batch sizes
Training and test loss of six deep CNNs on CIFAR-10 and ImageNet under different batch sizes
Training and test loss of six deep CNNs on CIFAR-10 and ImageNet under different initializations
Experimental results of two CNNs for CIFAR-10 show that the convergence speed of proposed LLb-SGD method is approximately four times faster than naive SGD. Furthermore, it significantly outperforms both NAG, path-SGD, and ESGD. The difference in the descent of training loss becomes particularly obvious after 20 epochs. It is worth noting that Adam and Adadelta eventually converge considerably faster and better than Adagrad, which shows that annealing of learning rate contributes to the convergence of the network in the later iterations.
LLb-SGD also performs best for ImageNet dataset, although the difference in the decline rate of training loss with other methods is not as significant as for CIFAR-10. It achieves better optimization results and improves the quality of the learned networks. Compared with the state-of-the-art training and test loss, our method achieves a reduction of 0.027 and 0.011, respectively. A possible explanation is that the optimization by LLb-SGD has introduced some implicit regularization. Gradual annealing learning rate helps deep CNNs to find simpler architectures in parameter space. On the other hand, the advantages of layer-wise learning can be clearly reflected in deep CNNs with more layers, which can be empirically proved by more improvements are obtained in ResNet than that in AlexNet and VGG-16.
Impact of batch size
We also empirically evaluate the impact of batch size on the optimization of deep CNNs. The performance of the six deep CNNs under different batch size settings is shown in Table 3. We find its reasonable choice is between 16 and 256 (both inclusive). It can be seen training loss is not significantly different for the change of batch size, which means the network optimization through LLb-SGD is robust to the choice of batch size. On the other hand, the larger the batch size is, the better the network will perform on the test set. We believe that the small gradient noise and large variance caused by the large batch size play an implicit regularization role, which help network to be generalized well on the test data.
Impact of network initialization
There have been many studies [23, 24] showing that network initialization has an important impact on the optimization. A good position can avoid bad minimums and saddle points. Therefore, many computer vision related tasks utilize the network pre-trained on ImageNet. Here, we plan to illustrate the ability of LLb-SGD to alleviate ill-conditioning problems by comparing the performance of different initialization methods, including uniform distribution, normal distribution, Xavier [23], and MSRA [24]. Experimental results are shown in Table 4.
The training and test loss of Xavier and MSRA are not always better than the uniform distribution and normal distribution, which indicates that LLb-SGD based optimization does not depend significantly on the initial position, and the network tends to converge to a desirable local minimum. The separate learning rate at each layer ensures that the bottom layers in network can be fully optimized, even if the gradient flow almost disappears (especially in AlexNet, VGG-16 and GoogLeNet). Moreover, layer-wise learning reduces the dependence of parameters, resulting in the sparsity and optimability of deep CNNs.
Visualization of training iterations
Adjustment of learning rate
We compare the variation of learning rate in LLb-SGD with two baseline methods during training: 1 / t decay (αt =α0 / (1+ kt)) and exponential decay (αt =α0×e-kt), in which k is set to 0.9 in both cases. The mean of the learning rates of all layers in different deep CNNs with LLb-SGD is used for comparison and the results are shown in Fig. 6. It can be seen that the learning rates of LLb-SGD initially drops rapidly but gradually slows down. They are larger than the two baseline methods when the number of iterations on CIFAR-10 and ImageNet is greater than 104 and 105, respectively. These are typically high learning rates that would lead to divergence in most methods, but this only occurs near the end of the training when gradients are small. In other words, moderately large learning rates are helpful to the convergence of the model in the later stage.

Adjustment of learning rate during optimization process of CIFAR-10 and ImageNet.
Then we observe the robustness of multiple deep CNNs trained by LLb-SGD to the initial learning rate, including the number of training iterations, training and test loss, as shown in Fig. 7. The initial learning rate is set from 0.001 to 0.1. The training process is ended when the decrease is no more than 0.1 after 1 000 iterations. A larger initial learning rate usually leads to fewer training iterations, that is, the network converges faster. Moreover, the initial learning rate has little impact on the final training loss and test loss, which demonstrates the invariance of the scheme to the initial learning rate.

The performance of deep CNNs trained by LLb-SGD under various initial learning rates.
We consider the performance of the optimization algorithm under two ill conditions (i.e. both including saddle points) by observing the descent trajectory of the loss function of a small network, as shown in Fig. 8. Two directions along the principal components of weight matrix are given to observe the optimization trajectory in two-dimensional space. The parameters near the optimal value are calculated as

Descent trajectory of various optimization methods in two ill-conditioning cases.
It can be clearly seen that the naive SGD, NAG, and Adagrad are easily trapped in the regions with saddle point and cannot escape from it. Adam and LLb-SGD can bypass the saddle point and direct us toward a local minimum. It is worth noting that LLb-SGD converges faster than Adam, which means that It is not necessary to set a individual learning rate for each parameter and the layer-wise learning method can be more efficient.
In this paper, we propose a simple and computationally efficient method for gradient-based optimization of objective functions in deep learning, which is able to accelerate training process and improve the practical performance of learned models. To the best of our knowledge, this is the first attempt to introduce an adaptive layer-wise learning schedule with a certain degree of convergence guarantee. Due to its generality and robustness, the method is insensitive to hyper-parameters and therefore can be applied to various network architectures and datasets. Extensive experiments on two image classification benchmarks (CIFAR-10 and ImageNet) and six standard CNNs (AlexNet, VGG-16, GoogLeNet, and ResNet-34/50/ 152) have empirically proved its effectiveness.
In the future, we plan to apply the proposed method to more deep learning based applications, such as object detection and instance segmentation
Footnotes
Acknowledgment
This work was supported by National Key R&D Program of China (Grant No. 2018YFC0831503), National Natural Science Foundation of China (Grant No. 61571275), and Fundamental Research Funds of Shandong University (Grant No. 2018JC040).
