Layer-wise learning based stochastic gradient descent method for the optimization of deep convolutional neural network

Abstract

Nowadays, despite the popularity of deep convolutional neural networks (CNNs), the efficient training of network models remains challenging due to several problems. In this paper, we present a layer-wise learning based stochastic gradient descent method (LLb-SGD) for gradient-based optimization of objective functions in deep learning, which is simple and computationally efficient. By simulating the cross-media propagation mechanism of light in the natural environment, we set an adaptive learning rate for each layer of neural networks. In order to find the proper local optimum quickly, the dynamic learning sequence spanning different layers adaptively adjust the descending speed of objective function in multi-scale and multi-dimensional environment. To the best of our knowledge, this is the first attempt to introduce an adaptive layer-wise learning schedule with a certain degree of convergence guarantee. Due to its generality and robustness, the method is insensitive to hyper-parameters and therefore can be applied to various network architectures and datasets. Finally, we show promising results compared to other optimization methods on two image classification benchmarks using five standard networks.

Keywords

Deep learning deep CNNs non-convex optimization SGD layer-wise learning

1 Introduction

Nowadays, artificial neural networks [1 , 62] have been developed and successfully applied in various fields, such as financial analysis [3], intelligent traffic navigation [4], computer-aided diagnosis [5, 6], and production automation [7]. Deep CNNs [8, 9] are a variant which introduce convolution operations and have achieved incredible success in the computer vision based applications such as object classification [10] and instance segmentation [11], even surpassing human performance [12, 70]. At present, a very deep CNN can even be made up of more than 1 000 layers through special designs [13].

Despite their popularity, the efficient training of deep CNNs remains challenging due to several problems. These include overfitting [14 , 64], vanishing and exploding gradients [15], saddle point [16], and slow convergence [17 , 66]. A large number of approaches to these problems have been proposed in various ways. Different activation functions like Leaky-ReLU [18] and Swish [19] are designed to help the propagation of gradient flow; some complex layers [20 –22] are added to enhance the network structure; a better initialization position [23, 24] in the parameter space is also explored. However, there are only a few studies on the stochastic learning of over-parameterized and highly nonconvex CNNs.

So far, the stochastic gradient descent (SGD) based optimization method is still the most commonly used technique for training deep CNNs. Many problems in deep learning can be viewed as the maximization or minimization of some scalar parameterized objective function (e.g., loss function) with respect to network parameters. For the convex objective function, SGD method can ensure that the network converges to the global minimum, while it usually converges to a local minimum for non-convex one.

Specifically, let ${L_{1}, L_{2}, . . ., L_{N}}$ be a sequence of vector functions from $ℝ^{d}$ to $ℝ$ . The goal of network $M : f (\cdot)$ is to find an approximate solution of the following objective $min Γ (θ), Γ (θ) = \frac{1}{N} \sum_{n = 1}^{N} L_{n} (θ)$ (1) where θ represents network parameters. $L$ is usually a loss function used to evaluate the performance of the network, such as a log-likelihood function. Given a training set ${(x_{i}, y_{i})}_{i = 1}^{N}$ , in which x_i ∈ ℝ ; ^d is the training sample and y_i∈ ℝ ; is corresponding ground-truth label. The loss of a sample x_i can be defined as $L_{i} (θ) = \frac{1}{C} \sum_{j = 1}^{C} y_{i}^{j} f (x_{i}; θ)^{j}$ (2) where C represents the number of categories. In the back propagation phase of SGD, network parameters at t-th iteration can be updated by $θ_{t} = θ_{t - 1} - α_{t - 1} \cdot \frac{\partial L (x_{i}, y_{i})}{\partial θ_{t - 1}}$ (3) where α_t is the learning rate to determine the step size of each iteration, and i can be randomly drawn from 1, 2, ... , N. The latter term in Equation (3) is the estimation of overall gradient of training set. In practice, a more stable version can be given by $θ_{t} = θ_{t - 1} - \frac{α_{t - 1}}{M} \cdot \sum_{i = 1}^{M} \frac{\partial L (x_{i}, y_{i})}{\partial θ_{t - 1}}$ (4) where M is the batch size. The two versions of SGD are equivalent when M = 1. In the mini-batch SGD, samples in a mini-batch set are used to approximate the overall gradient. The expectation $E (θ_{t} | θ_{t - 1})$ is identical to Equation (3). However, the gradient descent requires evaluation of M derivatives at each iteration, and is expensive. In addition, it has been proved that when the learning rate is slowly reduce, SGD has the same convergence behavior as mini-batch SGD [35].

In particular, if the batch size M is large, then the gradients in a mini-batch set have a relatively large variance which slows down the network convergence. Consider the case that each loss $L$ is continuously differentiable and the gradient of $L$ , namely $\nabla L,$ is Lipschitz continuous with a constant S > 0, i.e., $| | \nabla L (θ) - \nabla L (θ^{'}) | |_{2} ⩽ S | | θ - θ^{'} | |_{2}$ (5) for all ${θ, θ^{'}} \subset ℝ^{d};$ and Γ (θ) is strongly convex, i.e., $Γ (θ) - Γ (θ^{'}) - 0.5 η | | θ - θ^{'} | |_{2}^{2} ⩾ \nabla Γ (θ^{'})^{T} (θ - θ^{'})$ (6) where S≥η≥0. Neural network can be optimized by SGD with a linear convergence rate of O((1 – η / S)^t), as long as the learning rate α _t is set to a constant and satisfies α<1 / S [25]. Finally, the convergent neural network $M_{T}$ can be obtained by continuous iteration.

In fact, SGD has been empirically proved to be an efficient optimization method that plays a central role in many successful practical applications [33 , 69].

1.1 Challenges

SGD based optimization methods [26, 27], while increasingly popular, are usually used as black-box optimizers, as practical explanations of their working process and properties are hard to come by. On the other hand, the naive SGD cannot guarantee a good convergence position, and it poses a few challenges that need to be solved:

Choosing a suitable learning rate can be difficult. A small learning rate leads to slow convergence rate, while too large learning rate hinders model convergence, resulting in the loss fluctuating or even deviating from the minimum value.

It may be harmful to adopt a fixed learning rate for updating all the parameters. Training data in high dimensional space are usually sparse and thus features have various occurring frequencies. Under different scales or dimensions, it cannot adapt to the representation of features.

Some optimization strategies containing dynamic learning rates have been applied to the whole training process, e.g., annealing [28]. However, these schedules or thresholds have to be configured in advance and therefore cannot adapt to the different networks or datasets.

1.2 Contributions

In this paper, we propose the layer-wise learning based stochastic gradient descent (LLb-SGD) method for accelerating and improving the optimization of deep CNNs. By simulating the cross-media propagation mechanism of light in the natural environment, i.e., light has different propagation speeds in different media, we set an adaptive learning rate for each layer of neural networks. To find the proper local optimum quickly, the dynamic learning sequence spanning on various layers adaptively adjust the descending speed of objective function in the multi-scale and multi-dimensional environment. It is calculated on the per-layer basis using first-order information and requires only a trivial amount of extra computation per iteration over gradient descent. Intuitively, LLb-SGD can escape from saddle points [31] and sharp local minimums [32], which are unable to generalize well on unseen data. The descent method based on the layer-wise learning rate sequence results in a large variance, which enhances the impact of batch noises in these cases and helps the model escape from sharp local minimum and saddle points. Moreover, our approach can dynamically incorporate distribution information observed in earlier training iterations to perform the more informative gradient-based learning. This is, to the best of our knowledge, the first time to introduce an adaptive layer-wise learning schedule with a convergence guarantee and give experimental verification on two image classification benchmarks and six standard CNN architectures (ResNet-34/50 [13] for CIFAR-10 [29]; AlexNet [20], VGG-16 [8], GoogLeNet [9] and ResNet-152 for ImageNet [30]).

In summary, the contributions and benefits of this approach are as follows:

LLb-SGD introduces the adaptive and dynamic learning rates for each layer in deep CNNs. It is suitable for sparse gradients and naturally performs step annealing.

The sequence of learning rates is insensitive to hyper-parameters (e.g., parameter initialization, batch size and kernel size) during training iterations.

Due to its generality, our method can be applied to various network architectures and datasets.

We analyze the theoretical convergence properties of LLb-SGD and provide a regret bound on the convergence rate.

We empirically prove that it is not necessary to set a separate learning rate for each parameter, while ensuring network convergence.

Experimental results demonstrate that the LLb-SGD accelerates and improves network convergence, which outperforms state-of-the-art methods.

1.3 Organization

The remainder of the paper is organized as follows. Section 2 gives a brief review to the related work on optimizing deep neural networks. In Section 3, we introduce the proposed LLb-SGD method and update rule of network parameters. Section 4 provides the theoretical analysis of the LLb-SGD’s convergence conditions in non-convex optimization. Experimental results and comparisons are presented in Section 5. Finally, we conclude our work and future directions in Section 6.

2 Related work

In view of the theory incompleteness and existing problems of SGD, many analytical modifications and improvements [55 –60] have been proposed.

To solve the problem that it is difficult for SGD to escape from the steep local minimums, an additional momentum [36] was added to help the network cross bad positions. Nesterov-accelerated gradient (NAG) descent method [37] further introduced approximate gradient of the next position to predict the future direction, rather than blindly decreasing along the slope. This predictive update prevents the objective from falling too quickly and enhances the responsiveness of the algorithm. Adagrad [38] proposed adapting the learning rates to the parameters, i.e., to adopt a larger learning rate for features with lower frequency and a smaller learning rate for the more frequent features. Therefore, it is suitable for sparse data and greatly improves the robustness of SGD, which has been successfully applied to the training of GoogLeNet by Dean et al. [39]. However, the squared term of the accumulated gradient in Adagrad method leads to an infinitely small learning rate, and ultimately no additional information can be extracted. Adadelta [40] improved the problem of monotonously decreasing learning rate in Adagrad by limiting the window size of calculating historical gradient to a fixed value. The learning rate no longer needs to be set manually during the iteration, since it has been removed from the parameter update rule. In practical applications, the objective function during training iterations is easy to diverge. The RMSProp [41] divided the learning rate for a weight by a running average of recent gradient magnitudes for that weight.

Similarly, the Adam [42] adaptively calculated the learning rate of each parameter according to the first and second moments of respective gradients, which is appropriate for non-static targets or problems with noisy or sparse gradients. Kingma et al. [21] pointed out that as gradient becomes more and more sparse in the later stage of iterations, deviation correction can help Adam to be weakly superior to RMSProp. The authors also discussed Adamax, a variant of Adam through extending the L2-norm in parameter update rule to L∞-norm, which provides a simpler bound of magnitude of network parameter updates. Nesterov-accelerated Adam (Nadam) [43] proposed to modify the momentum term in Adam by taking advantage of the insights from NAG. Dauphin et al. put forward another adaptive learning method ESGD [44] based on the equilibration preconditioner, which explored how negative eigenvalues of the Hessian help design more appropriate learning rate schemes.

There are also some computational complex high-order algorithms, such as L-BFGS [45] and Quasi-Newton method [46], which are not suitable for high-dimensional data in practice and therefore we have not discussed them here. On the other hand, when there are a large number of available data and cheap computing power, it is a feasible way to use distributed computation to accelerate network training. To this end, many asynchronous or parallel SGD have been developed, such as Downpour SGD [47] and Elastic Averaging SGD [48]. Although asynchronous running is fast, the non-ideal communication between clients can lead to poor convergence.

Many other strategies are also cultivated to help network training, including curriculum learning (i.e., adjustment of input order of training samples) [49], model structure improvement (e.g., skip connection [13] and Batch normalization [20]), noisy learning (random labeling [51] and corrupted data [52]), etc. These methods do not conflict with our optimization method, and can achieve better results by combining with each other.

In fact, the efficient training of neural networks is highly dependent on network architecture, optimizer selection, initialization positions and a variety of other considerations. Unfortunately, the impact of these choices on the loss landscape of underlying objective function is unclear. Researchers hope to analyze and understand the convergence process of network by observing the descent trajectory of objective function, so many visualization methods of iterative process have been studied [32 , 54]. Due to the oversize dimensional parameter space brought about by the over-parameterized deep learning model is too complex, researchers generally proceed along the random direction from the convergent position to depict the loss landscape. Therefore, how to find a meaningful direction and accurately describe the gradient descent process is a hot research topic, which provides the possibility to build the implicit relationship between non-convex structures of objective function and the trainability of deep neural networks.

3 LLb-SGD method

3.1 Motivation and inspiration

Fixed learning rate is not conducive to the training of the network in later iterations, and it is expensive to prepare a separate learning rate for each parameter. In nature, the propagation speed of light in media of different densities is different due to frequency variations. Similarly, each layer in the neural network can be viewed as a medium of different scales to transfer gradients in the back propagation phase, as shown in Fig. 1. According to many visualization results [43, 54], the features extracted by convolutional kernels in the same layer are at the same semantic level. The former layers extract the low-level features including different colors and edges, while the latter features become more advanced and abstract. Therefore, inspired by the propagation mechanism of light in various media, we consider setting a separate learning rate for each layer of neural network instead of each parameter, and the learning rate for each layer can be determined by all the parameters of that layer in a deep CNN.

Fig.1

Each layer in the neural network can be viewed as a medium of different scales to transfer gradients in the back propagation phase.

Given a deep CNN model $M : f (\cdot)$ consisting of L trainer layers, where the parameters in the i-th layer are denoted as θⁱ. During training iterations, each error back propagation requires L learning rates α₁, α₂,..., α_L for L layers. All the learning rates are first initialized to a constant α₀. Then they are updated in the next forward propagation and used for subsequent iteration. These two steps are carried out alternately until the overall training loss no longer decreases and the network converges at a global or local minimum. The entire running flow is shown in Fig. 2.

Fig.2

The entire training and testing diagram of deep CNN.

3.2 Learning rate for each layer

The specific learning rate update rule is described in this part. Because of the difference of connection modes of the convolutional (conv) layer and the fully connected (fc) layer, their respective learning rate can be calculated separately in a clean and elegant pattern. During the whole network training process, only the first-order information, i.e. the L1-norm of parameter variation, needs to be additionally calculated.

3.2.1 Convolutional layer

In the convolutional layers of CNN, the weights of all convolution kernels that neurons connected to the inputs are fixed, which means that each kernel only focuses on the one type of features. During the back propagation in the t-th training iteration, the average Frobenius-norm of the variation of all convolution kernels in the k-th layer can be calculated by $τ_{t}^{k} = \frac{1}{F_{k} S_{k}} \sum_{j = 1}^{F_{k}} \sqrt{\sum_{i = 1}^{S_{k}} (θ_{t}^{k, j, i} - θ_{t - 1}^{k, j, i})^{2}}$ (7) where F_k and S_k represent the number of kernels and weights in one kernel at the k-th layer, respectively. $θ_{t}^{k, j, i}$ denotes the i-th weight of the j-th convolution kernel in the k-th layer. $τ_{t}^{k}$ characterizes the variation of the weights at this iteration, which contains the step information and the gradient information of last iteration. Then the decay term of learning rate can be defined as $ξ_{t}^{k} = 1 - \frac{1}{e^{τ_{t}^{k}}} \in (0, 1)$ (8)

On the other hand, we decide the descent direction by considering the current value of the training loss and its change direction, as given by $δ_{t} = {\begin{matrix} 1 & if L^{t} < U or L^{t} - L^{t - 1} < 0 \\ - 1 & if L^{t} > U and L^{t} - L^{t - 1} > 0 \end{matrix}$ (9) where U is a threshold to determine the convergence state of deep CNN. A too large and rising loss means that the model is trapped in a bad local extremum and needs to search for a more appropriate solution along the opposite direction of gradient descent. Therefore, δ_t is set to -1 to change the step direction.

Finally, the learning rate for the k-th layer at t-the training iteration is reduced in a negative exponential form, which can be formulated by $α_{k}^{t} = δ_{t} α_{k}^{t - 1} \cdot ξ_{t}^{k}$ (10)

So far, the specific decay strategy of learning rate sequence for convolutional layers in deep CNN has been fully introduced.

3.2.2 Fully connected layer

For the fully connected layers in a deep CNN, each neuron is connected by a single weight and therefore the calculation of the average Frobenius-norm in the iterative process can be simplified to $τ_{t}^{k} = \frac{1}{P_{k}} \sum_{i = 1}^{P_{k}} \sqrt{(θ_{t}^{k, i} - θ_{t - 1}^{k, i})^{2}} = \frac{1}{P_{k}} \sum_{i = 1}^{P_{k}} | θ_{t}^{k, i} - θ_{t - 1}^{k, i} |$ (11) where P_k denotes the number of weights in k-th fully connected layer. A single weight can be regarded as a non-shared convolution kernel with the size of 1×1, and their Frobenius-norm of variation is equivalent to the L2-norm, which are their own absolute value. It is worth noting that the variation information has been can be preserved in the last iteration, as given by $θ_{t}^{k} - θ_{t - 1}^{k} = - α_{t - 1}^{k} \cdot \frac{\partial L (x_{i}, y_{i})}{\partial θ_{t - 1}^{k}}$ (12)

It can be clearly seen that the overall gradient of the last iteration is also considered in this time, which helps to optimize with only first-order information. Therefore, only a small amount of extra computation is introduced.

The decay term ξ of learning rate and the descent direction δ of fully connected layers are consistent with that of convolutional layers. Then the learning rate for fully connected layers can be updated according to Equation (10). Finally, we can obtain the learning rate sequence ${α_{1}^{t}, α_{2}^{t}, . . ., α_{L}^{t}}$ of all network layers for the t-th training iteration.

Although many advanced networks have replaced the fully connected layer with global average pooling operation [9, 13], because the excessive parameters brought about by the fully connected layer affect the network generalization on the unseen data. However, Zhang et al. [50] found that the fully connected layer can act as a “firewall” in the transfer of network representation capacity by providing a large capacity of network architecture, especially if the source domain and the target domain differ greatly. In fact, it is still meaningful to discuss the influence of learning rate decay strategy on the optimization of fully connected layers, because the sufficient optimization of an over-parameterized non-convex model is the basis of good generalization.

3.3 Parameter update role

Each update of the model parameters in the training process requires M forward propagation steps and one back propagation step.

In the forward propagation process, sample images x₁, x₂, ... ,x_M in one mini-batch set are fed into the first layer of deep CNN and the corresponding output h₁ can be calculated by $h_{1} = σ (x; θ) = σ (W_{0}^{1} x + b_{0}^{1})$ (13) where W and b represent network weights and biases, respectively. σ(·) denotes the element wise non-linear activation function, such as ReLU [8] and Swish [19]. Since the output of one layer is the input of the next, the output of k-th layer for k = 2, ... , L – 1 can be calculated by $h_{k} = σ (W_{0}^{k} h_{k - 1} + b_{0}^{k})$ (14)

Table 1

Algorithm 1: Layer-wise learning based SGD.
Initialization: deep CNN $M_{0}$ , learning rate sequence ${α_{1}^{0}, α_{2}^{0}, . . ., α_{L}^{0}}$ ,batch size M, threshold U, network parameter θ ₀.
Input: training set ${(x_{i}, y_{i})}_{i = 1}^{N}$ .
fort = 1: Tdo
form = 1: Mdo
forward propagation:
compute the loss of x_m in batch D_t as Equation (2);
compute the gradient as Equations (18) and (19);
end for
back propagation:
compute the parameter θ _t as Equation (4);
update learning rates ${α_{1}^{t}, α_{2}^{t}, . . ., α_{L}^{t}}$ as Equation (10);
end for
Return: convergent network parameter θ _T .

Then the final output (prediction result) of deep CNN is $f (x) = W_{0}^{L} h_{L - 1} + b_{0}^{L}$ (15)

In the end, the training loss of a sample image x can be calculated according to Equation (2).

In the back propagation process, parameters (i.e., weights and biases) in k-th layer at the t-th training iteration can be updated by $W_{t}^{k} = W_{t - 1}^{k} - \frac{α_{t}^{k}}{M} \sum_{i = 1}^{M} \frac{\partial L (x_{i}, y_{i})}{\partial W_{t - 1}^{k}}$ (16) $b_{t}^{k} = b_{t - 1}^{k} - \frac{α_{t}^{k}}{M} \sum_{i = 1}^{M} \frac{\partial L (x_{i}, y_{i})}{\partial b_{t - 1}^{k}}$ (17) where $\frac{\partial L (x_{i}, y_{i})}{\partial W_{t - 1}^{k}} = \frac{1}{M} \sum_{j = 1}^{C} \frac{x_{i}^{j} (h_{k}^{j} - y_{i}^{j})}{h_{k}^{j} (1 - h_{k}^{j})}$ (18) $\frac{\partial L (x_{i}, y_{i})}{\partial b_{t - 1}^{k}} = \frac{1}{M} \sum_{j = 1}^{C} (h_{k}^{j} - y_{i}^{j})$ (19)

Through continuous alternate iterations of the above two steps, the convergent network $M_{T} : f (x; θ_{T})$ can be obtained by $θ_{T} = arg min Γ (θ)$ (20)

Till now, the update rule of parameters in the training process of deep CNN model has been fully described.

The pseudo-code of LLb-SGD, shown in Algorithm 1, is quite straightforward.

In the evaluation phase, the final training loss is used to observe the effectiveness of LLb-SGD, and the convergent CNN model is tested on the test set to evaluate its actual classification performance.

4 Convergence analysis

We discuss the convergence of deep CNNs under LLb-SGD using the online learning framework [27].

Given a sequence of convex objective functions {Γ₁ (θ₁) , . . . , Γ_T (θ_T)} in the training process of deep CNN, our goal is to update network parameters to minimize the objective function Γ_t (θ_t) as much as possible at t-th iteration. The regret R_T in the network optimization is defined as the sum of the differences between all the previous objectives Γ_t (θ_t) and the convergent one Γ_T (θ_T), and is calculated by $R_{T} = \sum_{t = 1}^{T} [Γ_{t} (θ_{t}) - Γ_{T} (θ_{T})]$ (21)

Then we can obtain the regret boundary of LLb-SGD, as given by $R_{T} ⩽ \frac{{MTL}^{2}}{2 α_{0} (1 - α_{0})} + \frac{\sqrt{N} log T}{2}$ (22)

Proof. First, by using Lipschitz assumption in [31], we can easily get (y – x)²≥(y – f (x))². Also, $\frac{\partial L (x_{i}, y_{i})}{\partial θ_{t - 1}^{k}} ⩽ L^{2} / - \sum_{l = 1}^{L} α_{l}$ (23)

So, $\begin{matrix} | | θ_{t} - θ_{T} | |_{F}^{2} ⩽ | | θ_{t - 1} - θ_{T} | |_{F}^{2} + L^{2} / - \sum_{l = 1}^{L} α_{l} \\ - \sum_{l = 1}^{L} α_{l} | | θ_{t - 1}^{l} - θ_{T}^{l} | |_{F} \end{matrix}$ (24)

Then we attempt to bound the regret without playing action θ_T at t-th iteration. By induction and recursion, $\begin{matrix} R_{T} = \sum_{t = 1}^{T} \sum_{i = 1}^{M} \frac{α_{t} \cdot | | θ_{t} - θ_{T} | |_{F}^{2}}{M} \cdot \frac{\partial L (x_{i}, y_{i})}{\partial θ_{t - 1}^{k}} \\ ⩽ \frac{\sqrt{N} log T}{2} + \sum_{t = 1}^{T} \sum_{l = 1}^{L} α_{t}^{l} \end{matrix}$ (25)

From the learning rate update rules given in Equation (10) and the classic Robbins-Monroe condition [67, 68] during training process as follows $\sum_{t = 1}^{T} \sum_{l = 1}^{L} α_{t}^{l} = \infty and \sum_{t = 1}^{T} \sum_{l = 1}^{L} (α_{t}^{l})^{2} < \infty$ (26)

We can further obtain the following inequality: $\sum_{t = 1}^{T} \sum_{l = 1}^{L} α_{t}^{l} = α_{0} \cdot \sum_{t = 1}^{T} (δ_{t} ξ_{t})^{t - 1} ⩽ \frac{{MTL}^{2}}{2 α_{0} (1 - α_{0})}$ (27)

By plugging this into the Equation (25), we can obtain the regret bound shown in Equation (22) with a complexity of O (log N / T).

Here we only discuss Euclidean geometry without considering the gradient descent on a non-Euclidean geometry. Moreover, the loss function is usually in a non-Markovian state in practice, depending not only on the nearest vector, but also on the previous vectors.

Fig.3

Details of network architecture for image classification.

Fig.4

Some examples of images in CIFAR-10 (first row) and ImageNet (second row).

5 Experimental results and comparison

5.1 Experimental setup

5.1.1 Baseline CNNs and benchmark datasets

We apply LLb-SGD method to six classical CNN structures, including the ResNet-34/50 for CIFAR-10, and AlexNet, VGG-16, GoogLeNet and ResNet-152 for ImageNet. AlexNet and VGG-16 are typical deep CNN models, which are composed of convolutional layer, down-sampling layer and fully connected layer. Residual blocks and Inception modules are added to ResNet and GoogLeNet, respectively. Furthermore, the fully connected layers have been replaced by the global average pooling layers. Details of above deep CNN architectures are shown in Fig. 3.

Two benchmark datasets CIFAR-10 and ImageNet represent coarse-grained and fine-grained classification tasks respectively, which are used to evaluate the effectiveness of optimization methods under different difficulties. ImageNet is a large scale dataset, which contains 1 000 classes of 1.2 million natural images for training and 50 000 images for testing. CIFAR-10 is a popular benchmark for small-scale classification, which contains 10 classes of 50 000 images for training and 10 000 images for testing. Some examples of images in two datasets are shown in Fig. 4.

Table 2
Hyper-parameters setting in the training of deep CNN on two datasets

Datasets Learning rate (for all layers) Batch size Dropout rate Weight decay Threshold

CIFAR-10 0.05 128 0.5 0.0001 0.1

ImageNet 0.1 256 0.5 0.0005 0.2

Datasets	Learning rate (for all layers)	Batch size	Dropout rate	Weight decay	Threshold
CIFAR-10	0.05	128	0.5	0.0001	0.1
ImageNet	0.1	256	0.5	0.0005	0.2

Table 3

Final training and test loss of six deep CNNs on CIFAR-10 and ImageNet

Methods	CIFAR-10				ImageNet
ResNet-34		ResNet-50		AlexNet		VGG-16		GoogLeNet		ResNet-152
Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss
naive SGD	0.2241	0.6724	0.2010	0.6447	0.6435	0.9923	0.5732	0.9220	0.5076	0.8671	0.3772	0.7938
path-SGD	0.2193	0.6700	0.1970	0.6440	0.6414	0.9877	0.5692	0.9172	0.5043	0.8669	0.3730	0.7891
NAG	0.2172	0.6711	0.1978	0.6432	0.6340	0.9939	0.5729	0.9221	0.5032	0.8674	0.3734	0.7931
Adagrad	0.2216	0.6719	0.2034	0.6483	0.6434	0.9958	0.5712	0.9244	0.5037	0.8637	0.3726	0.7963
Adam	0.1903	0.6540	0.1750	0.6255	0.6066	0.9753	0.5421	0.9179	0.4770	0.8406	0.3382	0.7670
Adadelta	0.2027	0.6609	0.1884	0.6455	0.6221	0.9771	0.5614	0.9130	0.4889	0.8552	0.3585	0.7982
RMSProp	0.2298	0.6583	0.1969	0.6284	0.6134	0.9904	0.5617	0.9091	0.4712	0.8489	0.3507	0.7765
ESGD	0.2080	0.6640	0.1994	0.6353	0.6291	1.0004	0.5520	0.8977	0.4836	0.8414	0.3674	0.7640
LLb-SGD	0.1984	0.6614	0.1841	0.6260	0.6033	0.9750	0.5442	0.9024	0.4788	0.8481	0.3113	0.7534

Fig.5

Training curves of various deep CNNs on CIFAR-10 and ImageNet.

5.1.2 Training details

Before starting training, network hyper-parameters including initialized learning rate, batch size, dropout rate, weight decay, and threshold for two datasets are set separately, as shown in Table 1. All the training and testing process of various deep CNNs are carried out under the TensorFlow deep learning framework [53], based on the workstation consisting of an Intel Core i9-7980XE CPU, two NVIDIA GeForce GTX Titan XP GPUs, 2×16 gigabytes of memory, and 2 terabytes of storage. To make a impartial comparison with state-of-the-art methods, no data augmentation techniques are used during the training process.

5.2 Performance comparison

To confirm the theoretical results and insights, we experimented with LLb-SGD method in comparison with the naive SGD, path-SGD, NAG, Adagrad, Adam, Adadelta, RMSProp, and ESGD. Five-fold cross-validation results are used as criteria for evaluation of various methods. The final training and test losses are summarized in Table 2. Training curves of six deep CNNs using above optimization methods are shown in Fig. 5.

Table 4
Training and test loss of six deep CNNs on CIFAR-10 and ImageNet under different batch sizes

Batch size CIFAR-10 ImageNet

ResNet-34 ResNet-50 AlexNet VGG-16 GoogLeNet ResNet-152

Train loss Test loss Train loss Test loss Train loss Test loss Train loss Test loss Train loss Test loss Train loss Test loss

8 0.2674 0.6643 0.2324 0.6434 0.6980 0.9650 0.6053 0.9012 0.5255 0.8679 0.3668 0.7702

16 0.2517 0.6552 0.2247 0.6413 0.6699 0.9587 0.5988 0.9153 0.5180 0.8655 0.3612 0.7597

32 0.2356 0.6646 0.2153 0.6425 0.6496 0.9618 0.5855 0.9175 0.5092 0.8582 0.3377 0.7547

64 0.2259 0.6639 0.2142 0.6401 0.6380 0.9619 0.5758 0.9176 0.5031 0.8622 0.3322 0.7584

128 0.2125 0.6680 0.2020 0.6349 0.6272 0.9650 0.5683 0.9123 0.4940 0.8600 0.3273 0.7605

256 0.2037 0.6663 0.1919 0.6376 0.6126 0.9784 0.5455 0.8995 0.4845 0.8670 0.3160 0.7531

Batch size	CIFAR-10	ImageNet
8	0.2674	0.6643	0.2324	0.6434	0.6980	0.9650	0.6053	0.9012	0.5255	0.8679	0.3668	0.7702
16	0.2517	0.6552	0.2247	0.6413	0.6699	0.9587	0.5988	0.9153	0.5180	0.8655	0.3612	0.7597
32	0.2356	0.6646	0.2153	0.6425	0.6496	0.9618	0.5855	0.9175	0.5092	0.8582	0.3377	0.7547
64	0.2259	0.6639	0.2142	0.6401	0.6380	0.9619	0.5758	0.9176	0.5031	0.8622	0.3322	0.7584
128	0.2125	0.6680	0.2020	0.6349	0.6272	0.9650	0.5683	0.9123	0.4940	0.8600	0.3273	0.7605
256	0.2037	0.6663	0.1919	0.6376	0.6126	0.9784	0.5455	0.8995	0.4845	0.8670	0.3160	0.7531

Table 5

Training and test loss of six deep CNNs on CIFAR-10 and ImageNet under different initializations

Initialization	CIFAR-10				ImageNet
ResNet-34		ResNet-50		AlexNet		VGG-16		GoogLeNet		ResNet-152
Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss	Train loss	Test loss
Uniform	0.2076	0.6707	0.1791	0.6300	0.6057	0.9727	0.5498	0.8966	0.4780	0.8466	0.3146	0.7571
Normal	0.1945	0.6613	0.1834	0.6278	0.6059	0.9782	0.5414	0.9046	0.4807	0.8439	0.3065	0.7534
MSRA	0.2041	0.6594	0.1852	0.6225	0.6064	0.9719	0.5443	0.9049	0.4837	0.8538	0.3119	0.7489
Xavier	0.1940	0.6584	0.1884	0.6229	0.6072	0.9718	0.5496	0.9005	0.4750	0.8450	0.3128	0.7531

Experimental results of two CNNs for CIFAR-10 show that the convergence speed of proposed LLb-SGD method is approximately four times faster than naive SGD. Furthermore, it significantly outperforms both NAG, path-SGD, and ESGD. The difference in the descent of training loss becomes particularly obvious after 20 epochs. It is worth noting that Adam and Adadelta eventually converge considerably faster and better than Adagrad, which shows that annealing of learning rate contributes to the convergence of the network in the later iterations.

LLb-SGD also performs best for ImageNet dataset, although the difference in the decline rate of training loss with other methods is not as significant as for CIFAR-10. It achieves better optimization results and improves the quality of the learned networks. Compared with the state-of-the-art training and test loss, our method achieves a reduction of 0.027 and 0.011, respectively. A possible explanation is that the optimization by LLb-SGD has introduced some implicit regularization. Gradual annealing learning rate helps deep CNNs to find simpler architectures in parameter space. On the other hand, the advantages of layer-wise learning can be clearly reflected in deep CNNs with more layers, which can be empirically proved by more improvements are obtained in ResNet than that in AlexNet and VGG-16.

5.3 Robustness to hyper-parameters

5.3.1 Impact of batch size

We also empirically evaluate the impact of batch size on the optimization of deep CNNs. The performance of the six deep CNNs under different batch size settings is shown in Table 3. We find its reasonable choice is between 16 and 256 (both inclusive). It can be seen training loss is not significantly different for the change of batch size, which means the network optimization through LLb-SGD is robust to the choice of batch size. On the other hand, the larger the batch size is, the better the network will perform on the test set. We believe that the small gradient noise and large variance caused by the large batch size play an implicit regularization role, which help network to be generalized well on the test data.

5.3.2 Impact of network initialization

There have been many studies [23, 24] showing that network initialization has an important impact on the optimization. A good position can avoid bad minimums and saddle points. Therefore, many computer vision related tasks utilize the network pre-trained on ImageNet. Here, we plan to illustrate the ability of LLb-SGD to alleviate ill-conditioning problems by comparing the performance of different initialization methods, including uniform distribution, normal distribution, Xavier [23], and MSRA [24]. Experimental results are shown in Table 4.

The training and test loss of Xavier and MSRA are not always better than the uniform distribution and normal distribution, which indicates that LLb-SGD based optimization does not depend significantly on the initial position, and the network tends to converge to a desirable local minimum. The separate learning rate at each layer ensures that the bottom layers in network can be fully optimized, even if the gradient flow almost disappears (especially in AlexNet, VGG-16 and GoogLeNet). Moreover, layer-wise learning reduces the dependence of parameters, resulting in the sparsity and optimability of deep CNNs.

5.4 Visualization of training iterations

5.4.1 Adjustment of learning rate

We compare the variation of learning rate in LLb-SGD with two baseline methods during training: 1 / t decay (α_t =α₀ / (1+ kt)) and exponential decay (α_t =α₀×e^-kt), in which k is set to 0.9 in both cases. The mean of the learning rates of all layers in different deep CNNs with LLb-SGD is used for comparison and the results are shown in Fig. 6. It can be seen that the learning rates of LLb-SGD initially drops rapidly but gradually slows down. They are larger than the two baseline methods when the number of iterations on CIFAR-10 and ImageNet is greater than 10⁴ and 10⁵, respectively. These are typically high learning rates that would lead to divergence in most methods, but this only occurs near the end of the training when gradients are small. In other words, moderately large learning rates are helpful to the convergence of the model in the later stage.

Fig.6

Adjustment of learning rate during optimization process of CIFAR-10 and ImageNet.

Then we observe the robustness of multiple deep CNNs trained by LLb-SGD to the initial learning rate, including the number of training iterations, training and test loss, as shown in Fig. 7. The initial learning rate is set from 0.001 to 0.1. The training process is ended when the decrease is no more than 0.1 after 1 000 iterations. A larger initial learning rate usually leads to fewer training iterations, that is, the network converges faster. Moreover, the initial learning rate has little impact on the final training loss and test loss, which demonstrates the invariance of the scheme to the initial learning rate.

Fig.7

The performance of deep CNNs trained by LLb-SGD under various initial learning rates.

5.4.2 Descent trajectory of model loss

We consider the performance of the optimization algorithm under two ill conditions (i.e. both including saddle points) by observing the descent trajectory of the loss function of a small network, as shown in Fig. 8. Two directions along the principal components of weight matrix are given to observe the optimization trajectory in two-dimensional space. The parameters near the optimal value are calculated as ${\hat{θ}}_{(d_{1}, d_{2})} = θ_{T} + (d_{1} μ + d_{2} υ) E$ (28) where d₁ ∈ (- 1, 1) and d₂ ∈ (- 1, 1) denote step lengths along the first two directions μ and υ in the principal component analysis. E is an all-one matrix which has the same dimensions as network weights.

Fig.8

Descent trajectory of various optimization methods in two ill-conditioning cases.

It can be clearly seen that the naive SGD, NAG, and Adagrad are easily trapped in the regions with saddle point and cannot escape from it. Adam and LLb-SGD can bypass the saddle point and direct us toward a local minimum. It is worth noting that LLb-SGD converges faster than Adam, which means that It is not necessary to set a individual learning rate for each parameter and the layer-wise learning method can be more efficient.

6 Conclusion

In this paper, we propose a simple and computationally efficient method for gradient-based optimization of objective functions in deep learning, which is able to accelerate training process and improve the practical performance of learned models. To the best of our knowledge, this is the first attempt to introduce an adaptive layer-wise learning schedule with a certain degree of convergence guarantee. Due to its generality and robustness, the method is insensitive to hyper-parameters and therefore can be applied to various network architectures and datasets. Extensive experiments on two image classification benchmarks (CIFAR-10 and ImageNet) and six standard CNNs (AlexNet, VGG-16, GoogLeNet, and ResNet-34/50/ 152) have empirically proved its effectiveness.

In the future, we plan to apply the proposed method to more deep learning based applications, such as object detection and instance segmentation

Footnotes

Acknowledgment

This work was supported by National Key R&D Program of China (Grant No. 2018YFC0831503), National Natural Science Foundation of China (Grant No. 61571275), and Fundamental Research Funds of Shandong University (Grant No. 2018JC040).

References

LeCun

, Bengio

and Hinton

, Deep learning, Nature 521 (2015), 436–444.

Segler

, Preuss

and Waller

M.P.

, Planning chemical syntheses with deep neural networks and symbolic AI, Nature 555 (2018), 604–610.

Deng

, et al., Deep direct reinforcement learning for financial signal representation and trading, IEEE Trans Neural Netw Learn Syst 28(3) (2017), 653–664.

Lee

H.S.

and Kim

, Simultaneous traffic sign detection and boundry estimation using convolutional neural network, IEEE Transactions on Intelligent Transportation Systems 19(5) (2018), 1652–1663.

Fauw

J.D.

, et al., Clinically applicable deep learning for diagnosis and referral in retinal disease, Nature Medicine 24 (2018), 1342–1350.

Titano

, et al., Automated deep-neural-network surveillance of cranial images for acute neurologic events, Nature Medicine 24 (2018), 1337–1341.

Park

, Kwon

and Kim

, Neural network-based output feedback control for reference tracking of underactuated surface vessels, Automatica 77 (2017), 353–359.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arxiv preprint, 2014, https://arxiv.org/abs/1409.1556

Szegedy

, et al., Going deeper with convolutions, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, Boston, USA, pp. 1–9.

10.

Zheng

, et al., A bilinear multi-scale convolutional neural network for fine-grained object classification, IAENG International Journal of Computer Science 45(2) (2018), 340–352.

11.

Jiang

and Chi

, A CNN model for semantic person part segmentation with capacity optimization, IEEE Transactions on Image Processing 28(5) (2019), 2465–2478.

12.

Zheng

and Yang

, A video stabilization method based on inter-frame image matching score, Global Journal of Computer Science and Technology 17(1) (2017), 35–40.

13.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, 2016, pp. 770–778.

14.

Chiang

, Chen

and Hsieh

, An agreement under early stopping and fault diagnosis protocol in a cloud computing environment, IEEE Access 6 (2018), 44868–44875.

15.

Huang

, Liu

, Maaten

and Weinberger

K.Q.

, Densely connected convolutional networks, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), HI, USA, 2017, pp. 4700–4708.

16.

Bedi

A.S.

, Koppel

and Rajawat

, Asynchronous saddle point algorithm for stochastic optimization in Heterogeneous networks, IEEE Transactions on Signal Processing 67(7) (2019), 1742–1757.

17.

Yao

, Global convergence of CNNs with neutral type delays and D operator, Neural Computing and Applications 29(1) (2016), 105–109.

18.

Zhang

, Zou

and Shi

, Dilated convolution neural network with LeakyReLU for environmental sound classification, International Conference on Digital Signal Processing (DSP), London, UK, 2017, pp. 1–5.

19.

Ramachandran

, Zoph

and Le

Q.V.

, Searching for activation functions, arxiv preprint, 2017.

20.

Ioffe

and Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, International Conference on Machine Learning (ICML), Lille, France, 2015, pp. 448–456.

21.

Kingma

, et al., Improved variational inference with inverse autoregressive flow, Advances in Neural Information Processing Systems (NIPS), Spain, 2016, pp. 4743–4751.

22.

Goodfellow

, et al., Maxout networks, International Conference on Machine Learning (ICML), Atlanta, USA, 2013, pp. 1319–1327.

23.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.

24.

, Zhang

, Ren

and Sun

, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015, pp. 1026–1034.

25.

Nesterov

, Introductory lectures on convex optimization: A basic course, Springer Science and Business Media 87, 2013.

26.

Neyshabur

, Salakhutdinov

and Srebro

, Path-SGD: Path-normalized optimization in deep neural networks, Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, 2015, pp. 2422–2430.

27.

Zinkevich

, Online convex programming and generalized infinitesimal gradient ascent, International Conference on Machine Learning (ICML), USA, 2003, pp. 928–936.

28.

Chatterjee

and Chakrabartty

, Decentralized global optimization based on a growth transform dynamical system model, IEEE Transactions on Neural Networks and Learning Systems 29(12) (2018), 6052–6061.

29.

Torralba

, Fergus

and Freeman

, 80 million tiny images: A large data set for nonparametric object and scene recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 30(11) (2008), 1958–1970.

30.

Russakovsky

, et al., ImageNet large scale visual recognition challenge, International Journal of Computer Vision 115(3) (2015), 211–252.

31.

Lobato

, et al., A general framework for constrained Bayesian optimization using information-based search, J Mach Learn Res 17(1) (2015), 5549–5601.

32.

Kawaguchi

, Deep learning without poor local minima, Advances in Neural Information Processing Systems (NIPS), 2016, Barcelona, Spain, pp. 586–594.

33.

Rizvi

and Lin

, Output feedback Q-learning for discrete-time linear zero-sum games with application to the H-infinity control, Automatica 95 (2018), 213–221.

34.

Oyama

, et al., Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers, IEEE International Conference on Big Data, 2016, Washington, USA, pp. 66–75.

35.

, Bassily

and Belkin

, The power of interpolation: Understanding the effectiveness of SGD in modern over-parametrized learning, International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018, pp. 3331–3340.

36.

Qian

, On the momentum term in gradient descent learning algorithms, Neural Netw 12(1) (1999), 145–151.

37.

Nesterov

, A method for unconstrained convex minimization problem with the rate of convergence o(1/k2), Doklady ANSSSR 269 (1983), pp. 543–547.

38.

Duchi

, Hazan

and Singer

, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (2011), 2121–2159.

39.

Dean

, Patterson

and Young

, A new golden age in computer architecture: Empowering the machine-learning revolution, IEEE Micro 38(2) (2018), 21–29.

40.

Zeiler

M.D.

, Adadelta: An adaptive learning rate method, arxiv preprint, 2012. https://arxiv.org/abs/1212.5701

41.

Tieleman

and Hinton

, Lecture 6.5-RMSProp, Coursera: Neural networks for machine learning,Technical report, 2012.

42.

Kingma

D.P.

and Ba

J.L.

, Adam: A method for stochastic optimization, International Conference on Learning Representations (ICLR), San Diego, USA, 2015, pp. 1–13.

43.

Dozat

, Incorporating Nesterov momentum into Adam, International Conference on Learning Representations Workshop (ICLRW), Puerto Rico, 2016, pp. 1–6.

44.

Dauphin

Y.N.

, Vries

and Bengio

, Equilibrated adaptive learning rates for non-convex optimization, Advances in Neural Information Processing Systems (NIPS), Montreal, Canada, 2015, pp. 1504–1512.

45.

Zhu

, et al., Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization, ACM Trans Math Softw 23(4) (1997), 550–560.

46.

Loke

M.H.

and Barker

R.D.

, Rapid least-squares inversion of apparent resistivity pseudosections by a quasi-Newton method, Geophysical Prospecting 44(1) (1996), 131–152.

47.

Dean

, et al., Large scale distributed deep networks, Conference and Workshop on Neural Information Processing Systems (NIPS), USA, 2012, pp. 1–11.

48.

Zhang

, Choromanska

and LeCun

, Deep learning with Elastic Averaging SGD, Advances in Neural Information Processing Systems Conference (NIPS), 2015, Canada, pp. 1–24.

49.

Khan

, Mutlu

and Zhu

, How do humans teach: On curriculum learning and teaching dimension, Annual Conference on Neural Information Processing Systems (NIPS), Spain, 2011, pp. 1449–1457.

50.

Zhang

, Luo

, Wei

and Wu

, In defense of fully connected layers in visual representation transfer, Pacific Rim Conference on Multimedia (PCM), Cham, 2017, pp. 807–817.

51.

Xie

, et al., Disturblabel: Regularizing CNN on the loss layer, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, USA, 2016, pp. 4753–4762.

52.

Menon

, et al., Learning from corrupted binary labels via class-probability estimation, International Conference on Machine Learning (ICML), Lille, France, 2015, pp. 125–134.

53.

Abadi

, et al., Tensorflow: A system for large-scale machine learning, USENIX Symposium on Operating Systems Design and Implementation OSDI 16, 2016, pp. 265–283.

54.

Zheng

, et al., Improvement of generalization ability of deep CNN via implicit regularization in two-stage training process, IEEE Access 6 (2018), 15844–15869.

55.

Duan

, Wei

and Huang

, Finite-time synchronization of delayed fuzzy cellular neural networks with discontinuous activations, Fuzzy Sets and Systems 361 (2019), pp. 56–70.

56.

Yang

, Huang

and Li

, Exponential synchronization control of discontinuous nonautonomous networks and autonomous coupled networks, Complexity, 2018, pp. 1–10.

57.

Duan

, Huang

, Guo

and Fang

, Periodic attractor for reaction–diffusion high-order Hopfield neural networks with time-varying delays, Computers & Mathematics with Applications 73(2) (2017), 233–245.

58.

Huang

, et al., Global convergence on asymptotically almost periodic SICNNs with nonlinear decay functions, Neural Processing Letters 49(2) (2019), 625–641.

59.

Huang

and Liu

, New studies on dynamic analysis of inertial neural networks involving non-reduced order method, Neurocomputing 325 (2019), 283–287.

60.

Huang

and Zhang

, Periodicity of non-autonomous inertial neural networks involving proportional delays and non-reduced order method,016, International Journal of Biomathematics 12(2) (1950), 1–13.

61.

Rubio

J.J.

, USNFIS: Uniform stable neuro fuzzy inference system, Neurocomputing 262 (2017), 57–66.

62.

Giap

, Son

and Chiclana

, Dynamic structural neural network, J Intell Fuzzy Syst 34(4) (2018), 2479–2490.

63.

Rubio

J.J.

, SOFMLS: Online self-organizing fuzzy modified least-squares network, IEEE Transactions on Fuzzy Systems 17(6) (2009), 1296–1309.

64.

, Li

, Sun

and Wang

, Assessing information security risk for an evolving smart city based on fuzzy and grey FMEA, J Intell Fuzzy Syst 34(4) (2018), 2491–2501.

65.

Rubio

, et al., Neural network updating via argument Kalman filter for modeling of Takagi-Sugeno fuzzy models, J Intell Fuzzy Syst 35(2) (2018), 2585–2596.

66.

Soares

, et al., Pyramidal neural networks with evolved variable receptive fields, Neural Computing and Applications 29(12) (2018), 1443–1453.

67.

Mokhtari

, Koppel

and Ribeiro

, Doubly random parallel stochastic methods for large scale learning, American Control Conference (ACC), USA, (2016), pp. 4847–4852.

68.

Koppel

, Mokhtari

and Ribeiro

, Parallel stochastic successive convex approximation method for large-scale dictionary learning, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Alberta, Canada, (2018), pp. 2771–2775.

69.

Zheng

, Yang

, Zhang

and Yang

, Understanding and boosting of deep convolutional neural network based on sample distribution, IEEE Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chengdu, China, 2017, pp. 823–827.

70.

Zheng

, Tian

, Yang

and Wang

, Differential learning: A powerful tool for interactive content-based Image Retrieval, Engineering Letters 27(1) (2019), 202–215.

Layer-wise learning based stochastic gradient descent method for the optimization of deep convolutional neural network

Abstract

Keywords

1 Introduction

1.2 Contributions

1.3 Organization

2 Related work

3 LLb-SGD method

3.1 Motivation and inspiration

3.2.1 Convolutional layer

5.1 Experimental setup

5.1.1 Baseline CNNs and benchmark datasets

Table 2 Hyper-parameters setting in the training of deep CNN on two datasets Datasets Learning rate (for all layers) Batch size Dropout rate Weight decay Threshold CIFAR-10 0.05 128 0.5 0.0001 0.1 ImageNet 0.1 256 0.5 0.0005 0.2

5.2 Performance comparison

5.3.1 Impact of batch size

5.3.2 Impact of network initialization

5.4 Visualization of training iterations

5.4.1 Adjustment of learning rate

Footnotes

Acknowledgment

References

Table 2
Hyper-parameters setting in the training of deep CNN on two datasets

Datasets Learning rate (for all layers) Batch size Dropout rate Weight decay Threshold

CIFAR-10 0.05 128 0.5 0.0001 0.1

ImageNet 0.1 256 0.5 0.0005 0.2