Learning rate burst for superior SGDM and AdamW integration

Abstract

Current mainstream deep learning optimization algorithms can be classified into two categories: non-adaptive optimization algorithms, such as Stochastic Gradient Descent with Momentum (SGDM), and adaptive optimization algorithms, like Adaptive Moment Estimation with Weight Decay (AdamW). Adaptive optimization algorithms for many deep neural network models typically enable faster initial training, whereas non-adaptive optimization algorithms often yield better final convergence. Our proposed Adaptive Learning Rate Burst (Adaburst) algorithm seeks to combine the strengths of both categories. The update mechanism of Adaburst incorporates elements from AdamW and SGDM, ensuring a seamless transition between the two. Adaburst modifies the learning rate of the SGDM algorithm based on a cosine learning rate schedule, particularly when the algorithm encounters an update bottleneck, which is called learning rate burst. This approach helps the model to escape current local optima more effectively.

The results of the Adaburst experiment underscore its enhanced performance in image classification and generation tasks when compared with alternative approaches, characterized by expedited convergence and elevated accuracy. Notably, on the MNIST, CIFAR-10, and CIFAR-100 datasets, Adaburst attained accuracies that matched or exceeded those achieved by SGDM. Furthermore, in training diffusion models on the DeepFashion dataset, Adaburst achieved convergence in fewer epochs than a meticulously calibrated AdamW optimizer while avoiding abrupt blurring or other training instabilities. Adaburst augmented the final training set accuracy on the MNIST, CIFAR-10, and CIFAR-100 datasets by 0.02%, 0.41%, and 4.18%, respectively. In addition, the generative model trained on the DeepFashion dataset demonstrated a 4.62-point improvement in the Frechet Inception Distance (FID) score, a metric for assessing generative model quality. Consequently, this evidence suggests that Adaburst introduces an innovative optimization algorithm that simultaneously updates AdamW and SGDM and incorporates a learning rate burst mechanism. This mechanism significantly enhances deep neural networks’ training speed and convergence accuracy.

Keywords

Convolutional neural networks (CNNs)MNIST CIFAR deep learning optimization algorithms person image generation diffusion models

1 Introduction

Advancements in hardware technology and deep neural networks have led to the creation of increasingly complex and large-scale models. These models demand more training steps and substantial computational resources. A key element in this process is the optimization algorithm, which plays a crucial role in determining the convergence speed and final accuracy of the models. This is in addition to the quality of the dataset and the design of the model itself.

Current optimization algorithms [1–6] for deep neural networks [7–16], such as non-adaptive algorithms like Stochastic Gradient Descent (SGD) [17, 18] and adaptive algorithms like Adam [19], are fundamental to the training process. However, they often face a trade-off: adaptive algorithms typically enable faster convergence, whereas non-adaptive algorithms tend to ensure greater convergence precision. This trade-off poses a dilemma between training efficiency and effectiveness, especially in mainstream deep neural networks and generative models.

In practical training scenarios, especially for generative models where the end of training is not necessarily determined by a final convergence point but rather by a predefined number of training steps, both the quick convergence of adaptive optimization algorithms and the high convergence precision of nonadaptive optimization algorithms are vital. Consequently, it becomes imperative for adaptive optimization algorithms to demonstrate a faster rate of convergence.

The Adaburst algorithm, introduced in this paper, effectively addresses these challenges. It uniquely combines the updates of adaptive and non-adaptive algorithms, ensuring continuity in the update process. This integration achieves rapid convergence with precision. While newer versions of adaptive optimization algorithms like AdaDB [20] are available, mainstream deep neural networks—including discriminative models like ResNet [21] and Vision Transformer (ViT) [22], as well as generative models like VQGAN [23] and Stable Diffusion [24]—still predominantly use AdamW as their preferred training optimization algorithm. Hence, Adaburst integrates AdamW [25] and SGDM as its foundational algorithms. Additionally, drawing inspiration from the simulated annealing technique, Adaburst innovatively increases the learning rate during updates, aiding in overcoming local optima, a common issue in existing methods.

Existing research has explored automatic switching and combining the benefits of both types of algorithms through optimization techniques. Notable examples include SWATS [26] and ADABOUND [27]. SWATS switches to SGDM upon reaching a specific threshold, whereas ADABOUND progressively restricts the maximum updates. In comparison, the Adaburst algorithm features a continuous update process, circumventing the non-convergence issues associated with the SWAT transition. This instability during switching, particularly in large-scale neural networks such as diffusion models [28], can result in suboptimal outcomes. The learning rate burst mechanism in Adaburst specifically addresses the issue of becoming trapped in local optima in algorithms like ADABOUND. The paper’s contributions are outlined as follows:

1.
By overlaying the updates of AdamW and SGDM, the algorithm benefits from the speed of adaptive optimization and the generalization ability and convergence precision of non-adaptive optimization.
2.
Throughout the update process, by stacking the updates of SGDM and AdamW, the non-convergence issue caused by switching algorithms midway is solved.
3.
Inspired by the simulated annealing algorithm, a method to increase the learning rate during updates, enabling the model to achieve superior generalization capabilities and final accuracy, is designed.

This paper is organized as follows: The Related Work section introduces the two foundational algorithms of Adaburst, AdamW and SGDM, detailing their specific algorithmic processes. The Model Architecture section provides details of the learning rate burst method and specific update steps of the Adaburst algorithm. The Experiments section presents the specific hyperparameter settings and tests Adaburst on the MNIST, CIFAR-10, CIFAR-100, and DeepFashion datasets. The Limitations section discusses the remaining flaws of the algorithm.
2 Related work

This section delves into adaptive and non-adaptive optimization algorithms, particularly highlighting AdamW and SGDM, which form the foundation of Adaburst proposed in this paper through their integration.

SGDM (Stochastic Gradient Descent with Momentum) is a widely used algorithm in deep neural network training. Initially, gradient descent (GD) was the basic non-adaptive, gradient-based optimization method. However, GD had significant drawbacks, notably its inefficiency. In GD, all parameters are adjusted in each iteration based on the gradient, making the training process highly susceptible to local optima. Furthermore, its back-propagation process during training can be markedly slow. To address these issues, two advanced methods were developed: Batch Gradient Descent (BGD) and Stochastic Gradient Descent (SGD). BGD updates parameters after processing the entire training set once. However, this method has become impractical with the increasing size of training datasets. In contrast, SGD divides the training set into smaller batches and updates parameters after each batch, improving training efficiency. Subsequent enhancements, including the addition of momentum and weight decay, have evolved SGD into its current form, SGDM. This evolution reflects a continual effort to overcome the limitations of traditional optimization methods and enhance model training effectiveness.

Algorithm 1 SGDM

Require:lr: learning rate of the optimizer

Require:M ∈ [0, 1]: degree of momentum

Require:D: degree of weight decay

Require:f (θ): model with parameters θ

Require:θ₀: initial parameters vector

1: m₀ ← 0 ⊳ initialize moment vector

2: t ← 0 ⊳ initialize time step

3: whileθ_t not converged do

4: t ← t + 1

5: θ_t ← θ_t-1 - θ_t-1 · D ⊳ weight decay

6: g_t ← ∇ _θf_t (θ_t-1) ⊳ Get gradients

7: m_t ← M · m_t-1 + g_t

8: θ_t ← θ_t - lr · m_t ⊳ Upgrade parameters

9: end while

10: returnθ_t

Most of the optimization algorithms that emerged during the preliminary development stage were based on smoothing of gradient and square of gradient. In this evolutionary line, Adagrad, which can be considered as the starting point, was based on the smoothing of the square of the gradient. The transition from Adagrad to RMSprop changed the way of gradient squared accumulation so that the update step did not decrease continuously with the time of updates. Adam combined SGDM with RMSprop, adding the accumulation of gradients and smoothing both gradients and gradient squares.

AdamW had a significant advantage over SGDM regarding model training speed and did not differ too much from SGDM regarding final accuracy. The proposers of AdamW found that the commonly used L2 regularization [29] for avoiding model divergence did not play the same role in AdamW as it did in SGDM. The effectiveness of the algorithm was significantly improved by using a functionally similar weight decay [30] instead of L2 regularization. Besides, AMSGrad is another algorithm to correct the possible learning rate increase of AdamW, whose performance is hardly demonstrated in practical applications. Thus, AdamW is still the mainstream choice for researchers.

Algorithm 2 AdamW

Require:lr: learning rate of the optimizer

Require: (β₁, β₂) ∈ (0, 1]: degree of memory for past updates

Require:D: degree of weight decay

Require:f (θ): model with parameters θ

Requireθ₀: initial parameters vector

1: m₀ ← 0 ⊳ initialize 1^st moment vector

2: v₀ ← 0 ⊳ initialize 2^nd moment vector

3: t ← 0 ⊳ initialize time step

4: ϵ ← 0.0001 ⊳ Prevent denominator from zero

5: whileθ_t not converged do

6: t ← t + 1

7: θ_t ← θ_t-1 - θ_t-1 · D ⊳ weight decay

8: g_t ← ∇ _θf_t (θ_t-1) ⊳ Get gradients

9: m_t ← β₁ · m_t-1 + g_t ⊳ 1^st moment

10: $v_{t} \leftarrow β_{2} \cdot v_{t - 1} + g_{t}^{2}$ ⊳ 2^nd moment

11: $\hat{m_{t}} \leftarrow \frac{m_{t}}{1 - β_{1}^{t}}$

12: $\hat{v_{t}} \leftarrow \frac{v_{t}}{\sqrt{1 - β_{2}^{t}}}$ ⊳ adapted lr

13: $θ_{t} \leftarrow θ_{t} - lr \cdot \frac{m_{t}}{\sqrt{v_{t} + ϵ}}$ ⊳ Upgrade parameters

14: end while

15: returnθ_t

The approach of integrating adaptive and non-adaptive optimization algorithms is exemplified by the method known as SWAT [26]. The fundamental concept of SWATS involves initiating the training process with the adaptive learning rate features of AdamW. Subsequently, the method transitions to using Stochastic Gradient Descent with Momentum (SGDM) upon meeting predefined criteria.

The criterion for transitioning in SWATS is specified as follows:

| \frac{λ_{k}}{1 - β_{2}^{k}} - γ_{k} | < ϵ

(1)

where:

–

λ_k is the dynamically estimated learning rate for SGD, updated as: λ_k = β₂λ_k-1 + (1 - β₂) γ_k.

–

γ_k measures the difference between the update direction p_k and the current gradient g_k, computed as: $γ_{k} = \frac{p_{k}^{T} p_{k}}{p_{k}^{T} g_{k}}$ .

–

β₂ is the decay factor for the second moment estimation in AdamW and k is the iteration number.

–

ϵ is a small positive threshold.

Once the switching condition is satisfied, SWATS transitions to SGDM with the learning rate Λ set as:

Λ = \frac{λ_{k}}{1 - β_{2}^{k}}

(2)

It’s worth noting that during the switching process of SWAT, the accumulation of the first and second moments from the previous AdamW iterations is completely discarded, which might lead to non-convergence issues.

3 Model architecture

Previous experiments have demonstrated that employing SWAT [26] to alternate between SGDM and AdamW in the training of large-scale models, such as diffusion models, often failed to converge. Additionally, in the context of training discriminative models, executing the switch too late in the training process fails to yield significant improvements in the outcomes. These observations lead to the formulation of several hypotheses: the divergence in update paths between SGDM and AdamW renders mid-training switching impractical, and AdamW’s limited efficacy in escaping local optima hampers further optimization of the model towards the latter stages of the update process.

The analysis underscores the necessity for an algorithm that not only concurrently updates both SGDM and AdamW but also possesses the capability to escape local optima towards the end of the training process. The proposed Adaburst algorithm addresses this requirement. Adaburst calculates and merges updates from SGDM and AdamW through summation during its update cycle.

Algorithm 3 Lrburst

Require:η_max: maximum learning rate

Require:η_min: minimum learning rate

Require:T_max: maximum of burst steps

Require:E: current epoch

Require:ς: flag of burst

1: T_cur ⊳ current step

2: φ ⊳ final result

3: if ς = = True

4: T_cur = 1

5: end if

$φ \leftarrow η_{\min} + \frac{1}{2} (η_{\max} - η_{\min}) (1 + cos (\frac{T_{cur}}{T_{\max}} π))$

7: T_cur ← T_cur + 1

8: returnφ

Algorithm 4 Adaburst

Require:lr: learning rate of the AdamW

Require: (β₁, β₂) ∈ (0, 1]: degree of memory for past updates

Require:D: degree of weight decay

Require:f (θ): module with parameters θ

Require:θ₀: initial parameters vector

Require Lrburst (): learning rate burst algorithm

1: m₀ ← 0 ⊳ initialize 1^st moment vector

2: v₀ ← 0 ⊳ initialize 2^nd moment vector

3: t ← 0 ⊳ initialize time step

4: A ← 0 ⊳ initialize AdamW’s update

5: S ← 0 ⊳ initialize SGDM’s update

6: ϵ ← 0.0001 ⊳ Prevent denominator from zero

7: whileθ_t not converged do

8: t ← t + 1

9: θ_t ← θ_t-1 - θ_t-1 · D ⊳ weight decay

10: g_t ← ∇ _θf_t (θ_t-1) ⊳ Get gradients

11: m_t ← β₁ · m_t-1 + g_t ⊳ 1^st moment

12: $v_{t} \leftarrow β_{2} \cdot v_{t - 1} + g_{t}^{2}$ ⊳ 2^nd moment

13: $\hat{m_{t}} \leftarrow \frac{m_{t}}{1 - β_{1}^{t}}$

14: $\hat{v_{t}} \leftarrow \frac{v_{t}}{1 - β_{2}^{t}}$

15: $A \leftarrow \frac{m_{t}}{\sqrt{v_{t} + ϵ}}$ ⊳ AdamW’s update

16: lr₂ ← lr · Lrburst ()

17: S ← m_t ⊳ SGDM’s update

18: θ_t ← θ_t - lr₁ · A - lr₂ · S ⊳ Upgrade parameters

19: end while

20: returnθ_t

In our method, a distinctive burst mechanism is implemented to enhance the learning rate when a particular model metric reaches a plateau. This approach is inspired by the principles of simulated annealing, incorporating a cosine annealing scheme. The efficacy of the model is evaluated using the accuracy of the training set. If there is no improvement in the model’s accuracy during recent epochs, the current step of reducing the learning rate to zero through cosine annealing is restarted. It is important to note that the selection of model metrics is adaptable and may differ according to the specific requirements of the task. The details of this learning rate scheduling method are outlined in Algorithm 3.

Integrating the aforementioned Lrburst algorithm to merge AdamW and SGDM, we obtain the proposed Adaburst algorithm, as shown in Algorithm 4. Adaburst reuses the first-order moment accumulation of AdamW, avoiding resource wastage due to redundant calculations.

4 Experiments

In this section, the effectiveness of the proposed Adaburst optimization algorithm is evaluated across a range of datasets. Initially, the algorithm undergoes a dimensionality reduction analysis, which includes scrutinizing the update quantities during training. The algorithm’s performance is then assessed by training Convolutional Neural Networks (CNNs) on the MNIST and CIFAR-10 datasets. This assessment involves a comparative analysis with the AdamW and SGDM algorithms. Following this, analogous comparative experiments are conducted on the CIFAR-100 dataset, employing the same algorithms. For these experiments, the more advanced Vision Transformer (ViT) [22] model is utilized, to better gauge the algorithm’s efficacy in diverse contexts. In addition to tests on discriminative models, the Adaburst algorithm is further evaluated using complex generative diffusion models with the Deepfashion dataset.

The hyperparameters for the Adaburst algorithm, which hold practical implications similar to those for AdamW and SGDM, are deliberately aligned with the latter to underscore Adaburst’s unique attributes. The hyperparameters employed in these experiments are detailed in Table 1. These hyperparameters are chosen following the conventional settings of the PyTorch deep learning framework, as cited in [31]. Furthermore, Table 2 delineates the specific conditions under which learning rate burst is triggered for various tasks.

Table 1
Hyperparameters

Algorithm AdamW SGDM Adaburst

β₁ (m) 0.9 0.9 0.9

β₂ - 0.999 0.999

lr 0.02 0.001 0.001

D 0.5 0.5 0.5

Algorithm	AdamW	SGDM	Adaburst
β₁ (m)	0.9	0.9	0.9
β₂	-	0.999	0.999
lr	0.02	0.001	0.001
D	0.5	0.5	0.5

Table 2

Timing to trigger lr burst

Dataset	Flag	Threshold
MNIST	Train accuracy	0.2%
CIFAR-10	Train accuracy	1%
CIFAR-100	Train accuracy	1%
Deepfashion	Train loss	1

Fig. 1

Dimensionality reduction analysis.

4.1 Dimensionality reduction analysis

Taking the experiment on the CIFAR-10 dataset as an example, a dimensionality reduction is conducted. The analysis of the training process on the CIFAR-10 dataset involves collecting the parameter update quantities for all convolutional layers. The quantities of different layers in the same epoch are then averaged, with the results presented in Fig. 1. The horizontal axis in this figure denotes the number of epochs, while the vertical axis represents the updated amount. It is revealed that during the initial phase, the Adaburst algorithm exhibits a wider search range. Furthermore, it effectively mitigates oscillations post-convergence in the later stages of training.

4.2 MNIST

This section presents a comparative analysis of the Adaburst algorithm against the well-established SGDM and AdamW. The experiments are conducted on the ResNet framework, a prominent CNN model, using the MNIST dataset [32]. The dataset consists of 70,000 handwritten digit images, divided into 60,000 training images and 10,000 testing images. As an image classification dataset, the MNIST dataset exclusively comprises black and white images, thereby offering reduced input dimensionality. ResNet, recognized as the most prevalent CNN model [21], is the primary network used for a vast array of image processing tasks. The experiments conducted in this study span 200 epochs.

The experimental procedure for each epoch is structured as follows: Firstly, the dataset was segmented into small batches, each containing 128 samples. Secondly, after completing the training for each epoch, the resulting CNN model was evaluated on the testing set concurrently with the training process. It is important to note that no training method correlating precision increase with update necessity was utilized.

In this experimental study, the performance of the proposed Adaburst algorithm is juxtaposed with that of SGDM and AdamW. Additionally, RMSProp and Adam were initially included in the experiments. However, their performance markedly lagged behind the other two algorithms, leading to their exclusion to maintain clarity in the visual representation of results in figures and tables.

The comparative analysis of test results from the training set is depicted in Fig. 2. For a more direct assessment of model training speed and final convergence accuracy, refer to Table 3. The analysis reveals that on simpler datasets like MNIST, Adaburst demonstrates enhanced stability in the final stages of convergence, compared to SGDM and AdamW. Notably, surpassing the training set accuracy of SGDM suggests that the model retains its generalization capabilities, avoiding entrapment in local optima.

Table 3
Experimental data on MNIST

Algorithm Final test acc First epoch that train acc to 99%

AdamW 99.06% 19

SGDM 99.11% 11

Adaburst 99.13% 15

Algorithm	Final test acc	First epoch that train acc to 99%
AdamW	99.06%	19
SGDM	99.11%	11
Adaburst	99.13%	15

Fig. 2

Comparison experiments on MNIST.

4.3 CIFAR-10

This segment highlights the performance of the proposed Adaburst algorithm compared to other optimization algorithms, utilizing the ResNet model. The CIFAR-10 dataset, employed for this evaluation, consists of 60,000 images (32 × 32 RGB), divided into 50,000 for training and 10,000 for testing, across 10 classes with 6,000 images each [33].

The ResNet model underwent training over 200 epochs, with its performance assessed against the testing set after each epoch. The differences between the three optimization algorithms, including Adaburst, are illustrated in Fig. 3, which depicts results from the training process. Notably, Adaburst exhibited an enhanced rate of accuracy improvement for the training set at the onset of training, as well as a higher final accuracy at the end of the training period, as shown in Table 4. The results indicate that Adaburst’s more aggressive initial update strategy led to quicker initial training speeds. This approach also facilitated reaching a more effective extremum, thereby improving both training and testing accuracy.

Table 4
Experimental data on CIFAR-10

Algorithm Final test acc First epoch that

train acc to 90%

AdamW 93.55% 12

SGDM 92.33% 13

Adaburst 93.96% 11

Algorithm	Final test acc	First epoch that
AdamW	93.55%	12
SGDM	92.33%	13
Adaburst	93.96%	11

Fig. 3

Comparison experiments on CIFAR-10.

4.4 CIFAR-100

This segment outlines a comparative analysis of our Adaburst algorithm against SGDM and AdamW, using the challenging CIFAR-100 dataset. CIFAR-100 comprises 50,000 training and 10,000 testing images, distributed across 100 subclasses. Each subclass contains 600 RGB images of 32 × 32 size. Compared to the previously discussed datasets, CIFAR-100 poses a greater challenge due to its increased number of classifications without a corresponding increase in sample size, resulting in fewer samples per classification [34].

To address this challenge, the VIT-Tiny model was utilized. This model is a scaled-down version of the standard Vision Transformer (ViT) with a parameter count comparable to ResNet-34. VIT-Tiny, built entirely on the Transformer architecture without convolutional operations, relies solely on attention mechanisms. This design offers improved stability and proves particularly effective for classification tasks like CIFAR-100, where each class has a limited number of samples.

The performance disparities among the optimization algorithms are more pronounced in the CIFAR-100 dataset, as shown in Fig. 4. These findings suggest that Adaburst surpasses SGDM and AdamW, especially as dataset complexity increases. Table 5 presents specific evaluation metrics for this experiment. The advantages of Adaburst were particularly notable in this context. With the CIFAR-100 dataset’s increased complexity and the model’s sophisticated structure, the optimization process encounters more challenges, such as a greater number of local optima due to more categories with fewer images per category. While AdamW led in training set accuracy over SGDM, it demonstrated poorer performance on the test set, indicating susceptibility to local optima and overfitting. In contrast, Adaburst effectively circumvented these issues, achieving superior final convergence accuracy.

Table 5
Experimental data on CIFAR-100

Algorithm Final test acc First epoch that

train acc to 80%

AdamW 67.50% 22

SGDM 71.21% 25

Adaburst 75.39% 17

Algorithm	Final test acc	First epoch that
AdamW	67.50%	22
SGDM	71.21%	25
Adaburst	75.39%	17

Fig. 4

Comparison experiments on CIFAR-100.

Fig. 5

Experiments on DeepFashion.

Fig. 6

Changes in FID During the Training Process with Different Optimization Algorithms.

4.5 DeepFashion

The DeepFashion dataset [35] is a comprehensive collection of images featuring individuals in various clothing styles, making it a valuable resource for research in human image generation. It includes a total of 289,222 images, categorized into 50 clothing categories and 1,000 clothing attributes. The unique aspect of this dataset is the presence of images of the same individual in different styles, which is particularly useful for generation tasks.

Diffusion models [28], which represent the latest advancements in image generation, transform the conventional one-step generation process into a multi-step denoising procedure. This approach significantly enhances the quality of generated images compared with GANs [36] and VAEs [37]. Utilizing these diffusion model principles, human image generation is conducted on the DeepFashion dataset. This application serves to demonstrate the stability of our algorithm when applied to larger models [38].

The experimental results are depicted in Fig. 5. Given the distinctive characteristics of generative models, it is necessary to selectively choose epochs to assess the effectiveness of the training accurately. The Frechet Inception Distance (FID) [39], a standard image quality assessment metric, is used for this purpose, where a lower FID value indicates higher image quality. The outcomes are shown in Fig. 6.

In comparison with the other two algorithms, Adaburst shows a significant improvement in training speed. By the 125th epoch, the model achieves high-quality generative effects. This acceleration means that in experiments conducted on four RTX 4090 graphics cards, the training duration can be reduced by approximately 30 hours from a total of 120 hours.

4.6 Ablation study

Experiments utilizing the SWAT [26] training methodology were initially performed on diffusion models to assess the feasibility of direct switching. The learning rate and other hyperparameters applied in these experiments were derived from parameters commonly used in SGDM-related studies. The decision on when to implement the switch adhered to the schedule specified in the foundational paper, selecting the variant alternating between AdamW and SGDM. The outcomes, as depicted in Fig. 7, unequivocally reveal a pronounced occurrence of blurring, thereby substantiating the impracticality of direct switching between these optimization algorithms in the context of training diffusion models.

Fig. 7

Experiments of SWAT Algorithm on DeepFashion.

Table 6

Ablation study of Adaburst

	MNIST	CIFAR-10	CIFAR-100
B1	99.03%	92.98%	72.35%
B2	99.12%	93.43%	73.95%
B3	99.07%	93.72%	68.75%
Ours	99.13%	93.96%	75.39%

Additionally, to further validate the effectiveness of the algorithm’s various components, ablation studies were conducted. These studies focused on key elements of the algorithm: the learning rate burst and the integration of SGDM with AdamW. For comparison, three baseline groups were established: (1) Integration of SGDM and AdamW only; (2) Combination of SGDM with learning rate burst; (3) Combination of AdamW with learning rate burst. All experiments adhered to the hyperparameters detailed in previous sections and underwent training over 200 epochs. The evaluation criterion was the final convergence test set accuracy. Results, as presented in Table 6, demonstrate the individual contributions of each component to the algorithm’s performance. The experiment shows that the combination of AdamW and the learning rate burst mechanism does not produce a significant effect, confirming the rationale of using this mechanism exclusively with SGDM.

5 Limitation

Adaburst has achieved better results than SGDM and AdamW across multiple datasets, but it still has some shortcomings. One notable drawback is the complexity of the hyperparameter setting. Adaburst necessitates not only the configuration of standard hyperparameters but also the determination of appropriate timing for the learning rate burst trigger. This additional requirement complicates the process of hyperparameter tuning, presenting a challenge in its practical application.

Furthermore, the learning rate burst mechanism in Adaburst encounters inefficiencies in specific contexts. This inefficiency arises primarily because the initial phases of elevated learning rates can lead to suboptimal training outcomes. Terminating training during these phases could potentially diminish the model’s final accuracy.

6 Conclusion

The manuscript concentrates on the evaluation of optimization algorithms, specifically AdamW and SGDM, aiming to improve both convergence speed and final accuracy. It presents computational comparisons of these algorithms, demonstrating accelerated training speeds in the initial phases. The introduction of a new technique, termed ’learning rate burst,’ draws inspiration from simulated annealing. This approach is designed to overcome the challenge of stagnation in updates following a switch between algorithms. The paper introduces ‘Adaburst,’ an algorithm developed by amalgamating these strategies.

Moreover, by reutilizing AdamW’s parameters for a combined update with SGDM, Adaburst avoids the need for additional storage space. This integration entails no extra computations beyond those in AdamW, except for the incorporation of weight planning for both algorithms. Consequently, there is no escalation in space complexity, and the rise in time complexity remains negligible.

Finally, Adaburst is compared with SGDM and AdamW across MNIST, CIFAR-10, and CIFAR-100 datasets using ResNet and VIT architectures. Results suggest Adaburst’s superiority over SGDM and AdamW in terms of both convergence speed and final accuracy. Further experiments with the Deepfashion dataset using diffusion models validate Adaburst’s efficacy in more complex scenarios.

The Adaburst algorithm, which incorporates a fundamental learning rate burst mechanism, effectively combines elements of two distinct algorithms. This feature unveils the potential for performance enhancement via algorithmic substitutions, underscoring the algorithm’s versatility. Moreover, the simplicity of its learning rate burst mechanism, along with its easily definable activation conditions, offers promising avenues for future research. These include optimizing the timing, form and recovery process of the burst mechanism.

Footnotes

Acknowledgement

This research was funded by the Key Research and Development Project of China State Railway Group Company Ltd., under Grant Number [N2021G004].

References

Zaheer

Shaziya

, A study of the optimization algorithms in deep learning, in 2019 Third International Conference on Inventive Systems and Control (ICISC), (2019), pp. 536–539. doi: 10.1109/ICISC44355.2019.9036442.

Kaydani

Mohebbi

, A comparison study of using optimization algorithms and artificial neural networks for predicting permeability, Journal of Petroleum Science and Engineering112 (2013), 17–23. doi: 10.1016/j.petrol.2013.11.009.

Soydaner

, A comparison of optimization algorithms for deep learning, International Journal of Pattern Recognition and Artificial Intelligence34(13) (2020), 2052013. doi: 10.1142/S0218001420520138.

Zohrevand

Imani

, An empirical study of the performance of different optimizers in the deep neural networks, in 2022 International Conference on Machine Vision and Image Processing (MVIP), (2022), pp. 1–5. doi: 10.1109/MVIP53647.2022.9738743.

Mamoudan

M.M.

Ostadi

Pourkhodabakhsh

Fathollahi-Fard

A.M.

Soleimani

, Hybrid neural network-based metaheuristics for prediction of financial markets: A case study on global gold market, Journal of Computational Design and Engineering10(3) (2023), 1110–1125. doi: 10.1093/jcde/qwad039.

Chen

Song

, Modeling landslide susceptibility based on convolutional neural network coupling with metaheuristic optimization algorithms, International Journal of Digital Earth16(1) (2023), 3384–3416. doi: 10.1080/17538947.2023.2249863.

Andrychowicz

Denil

Colmenarejo

S.G.

Hoffman

M.W.

Pfau

Schaul

de Freitas

, Learning to learn by gradient descent by gradient descent, in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, (2016), pp. 3981–3989.

Duchi

J.C.

Hazan

Singer

, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res.12 (2011), 2121–2159. doi: 10.5555/1953048.2021068.

LeCun

Bottou

Bengio

Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE86(11) (1998), 2278–2324. doi: 10.1109/5.726791.

10.

Long

Shelhamer

Darrell

, Fully convolutional networks for semantic segmentation, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7–12, (2015), pp. 3431–3440. IEEE Computer Society, 2015. doi: 10.1109/CVPR.2015.7298965

11.

Mou

, Semisupervised multitask learning for scene recognition, IEEE Transactions on Cybernetics45(9) (2014), 1967–1976. doi: 10.1109/TCYB.2014.2362959.

12.

Zhao

, A general framework for edited video and raw video summarization, IEEE Transactions on Image Processing26(8) (2017), 3652–3664. doi: 10.1109/TIP.2017.2695887.

13.

Zheng

Yuan

, Remote sensing scene classification by unsupervised representation learning, IEEE Transactions on Geoscience and Remote Sensing55(9) (2017), 5148–5157. doi: 10.1109/TGRS.2017.2702596.

14.

Pan

Chen

Long

Zhang

P.S.

, A comprehensive survey on graph neural networks, IEEE Transactions on Neural Networks and Learning Systems32(1) (2021), 4–24. doi: 10.1109/TNNLS.2020.2978386.

15.

Otter

D.W.

Medina

J.R.

Kalita

J.K.

, A survey of the usages of deep learning for natural language processing, IEEE Transactions on Neural Networks and Learning Systems32(2) (2021), 604–624. doi: 10.1109/TNNLS.2020.2979670.

16.

Galassi

Lippi

Torroni

, Attention in natural language processing, IEEE Transactions on Neural Networks and Learning Systems32(10) (2021), 4291–4308. doi: 10.1109/TNNLS.2020.3019893.

17.

Krizhevsky

Sutskever

Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, Communications of the ACM60(6) (2017), 84–90. doi: 10.1145/3065386.

18.

Sutskever

Martens

Dahl

Hinton

, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning, (2013), pp. 1139–1147. PMLR.

19.

Kingma

D.P.

, Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. doi: 10.48550/arXiv.1412.6980.

20.

Khan

M.U.S.

Jawad

Khan

S.U.

, Adadb: Adaptive diff-batch optimization technique for gradient descent, IEEE Access9 (2021), 99581–99588. doi: https://doi.org/10.1109/ACCESS.2021.3096976

21.

Zhang

Ren

Sun

, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778. doi: 10.1109/CVPR.2016.90.

22.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

, An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020. doi: 10.48550/arXiv.2010.11929

23.

Esser

Rombach

Ommer

, Taming transformers for high-resolution image synthesis, in IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, (2021), pp. 12873–12883. Computer Vision Foundation / IEEE, 2021. doi: 10.1109/CVPR46437.2021.01268.

24.

Rombach

Blattmann

Lorenz

Esser

Ommer

, High-resolution image synthesis with latent diffusion models, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 10684–10695. doi: 10.1109/CVPR52688.2022.01042.

25.

Loshchilov

Hutter

, Decoupled weight decay regularization, in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. doi: https://doi.org/10.48550/arXiv.1711.05101

26.

Keskar

N.S.

Socher

, Improving generalization performance by switching from adam to SGD. CoRR, abs/1712.07628, 2017. doi: 10.48550/arXiv.1712.07628.

27.

Luo

Xiong

Liu

Sun

, Adaptive gradient methods with dynamic bound of learning rate. CoRR, abs/1902.09843, 2019. doi: 10.48550/arXiv. 1902.09843

28.

Jain

Abbeel

, Denoising diffusion probabilistic models. CoRR, abs/2006.11239. doi: 10.48550/arXiv.2006.11239.

29.

Cortes

Mohri

Rostamizadeh

, Rostamizadeh, L2 regularization for learning kernels. CoRR, abs/1205.2653, 2012. doi: 10.48550/arXiv.1205.2653

30.

Krogh

Hertz

J.A.

, A simple weight decay can improve generalization, in Advances in Neural Information Processing Systems 4, [NIPS Conference, Denver, Colorado, USA, December 2–5, 1991], (1991), pp. 950–957. Morgan Kaufmann.

31.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

, et al. Pytorch: An imperative style, high-performance deep learning library, Advances in Neural Information Processing Systems32 (2019). doi: 10.48550/arXiv.1912.01703.

32.

Deng

, The mnist database of handwritten digit images for machine learning research [best of the web], IEEE Signal Processing Magazine29(6) (2012), 141–142. doi: 10.1109/MSP.2012.2211477.

33.

Ho-Phuoc

, CIFAR10 to compare visual recognition performance between deep neural networks and humans. CoRR, abs/1811.07270, 2018. doi: https://doi.org/10.48550/arXiv.1811.07270.

34.

Wang

Chen

, Empirical evaluation of rectified activations in convolutional network. CoRR, abs/1505.00853, 2015. doi:10.48550/arXiv.1505.00853.

35.

Liu

Luo

Qiu

Wang

Tang

, Deepfashion: Powering robust clothes recognition and retrieval with rich annotations, in 2016 IEEE Conference on Computer Vision andPattern Recognition,CVPR2016, LasVegas,NV, USA, June 27-30, (2016), pp. 1096–1104. IEEE Computer Society. doi: 10.1109/CVPR.2016.124.

36.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

, Generative adversarial networks, Communications of the ACM63(11) (2020), 139–144. doi: https://doi.org/10.1145/3422622

37.

Kingma

D.P.

Welling

, Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. doi: 10.48550/arXiv.1312.6114.

38.

Shen

Zhang

Wang

Han

Yang

, Advancing pose-guided image synthesis with progressive conditional diffusion models. CoRR, abs/2310.06313, 2023. doi: 10.48550/ARXIV.2310.06313.

39.

Heusel

Ramsauer

Unterthiner

Nessler

Hochreiter

, Gans trained by a two timescale update rule converge to a local nash equilibrium, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA (2017), pp. 6626–6637.

Learning rate burst for superior SGDM and AdamW integration

Abstract

Keywords

1 Introduction

4 Experiments

Table 1 Hyperparameters Algorithm AdamW SGDM Adaburst β1 (m) 0.9 0.9 0.9 β2 - 0.999 0.999 lr 0.02 0.001 0.001 D 0.5 0.5 0.5

4.2 MNIST

Table 3 Experimental data on MNIST Algorithm Final test acc First epoch that train acc to 99% AdamW 99.06% 19 SGDM 99.11% 11 Adaburst 99.13% 15

Table 4 Experimental data on CIFAR-10 Algorithm Final test acc First epoch that train acc to 90% AdamW 93.55% 12 SGDM 92.33% 13 Adaburst 93.96% 11

Table 5 Experimental data on CIFAR-100 Algorithm Final test acc First epoch that train acc to 80% AdamW 67.50% 22 SGDM 71.21% 25 Adaburst 75.39% 17

4.6 Ablation study

6 Conclusion

Footnotes

Acknowledgement

References

Table 1
Hyperparameters

Algorithm AdamW SGDM Adaburst

β₁ (m) 0.9 0.9 0.9

β₂ - 0.999 0.999

lr 0.02 0.001 0.001

D 0.5 0.5 0.5

Table 3
Experimental data on MNIST

Algorithm Final test acc First epoch that train acc to 99%

AdamW 99.06% 19

SGDM 99.11% 11

Adaburst 99.13% 15

Table 4
Experimental data on CIFAR-10

Algorithm Final test acc First epoch that

train acc to 90%

AdamW 93.55% 12

SGDM 92.33% 13

Adaburst 93.96% 11

Table 5
Experimental data on CIFAR-100

Algorithm Final test acc First epoch that

train acc to 80%

AdamW 67.50% 22

SGDM 71.21% 25

Adaburst 75.39% 17