Stochastic gradient descent with variance reduction technique

Abstract

Gradient descent is prevalent for large scale optimization problems in machine learning, especially its major role is computing and correcting the connection strength of neural network in deep learning. However, choosing a proper learning rate for SGD can be difficult. A too small rate may lead to painfully slow convergence, while too large one would hinder convergence. In this paper, we present a novel variance reduction technique which applies the moving average of gradient termed SMVRG. SMVRG can take a large learning rate by using variance reduction technique. And, we only need to preserve current gradient and the previous average gradient. Our method is employed to Long Short-Term Memory (LSTM). The experiment on two data sets, the IMDB (movie reviews) and SemEval-2016 (sentiment analysis in twitter) shows our method can improve the results significantly.

Keywords

Learning rate neural network moving average LSTM

1. Introduction

In machine learning and statistics, we always encounter the optimization auxiliary objective function Eq. (1) with respect to a set of parameters. And nowadays, for a given network architecture in deep learning, one usually starts with an auxiliary objective function w.r.t. the weights which are the connection strengths between units. Thus, most learning algorithms are based on iterative methods, which aim to find a set of parameters by taking small steps iteratively towards a direction until it reaches a desired solution. In Eq. (1), $φ_{1} (w), φ_{2} (w), \dots, φ_{n} (w)$ is the sequence of vector function corresponding to the sequence of n training example $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})$ . If we use $φ_{i} (w) = ln (1 + exp (- w^{T} x_{i} y_{i}))$ ( $y_{i} \in \pm 1$ ), then we can achieve regularized logistic regression; When we take the squared loss $φ_{i} (w) = {(w^{T} x_{i} - y_{i})}^{2}$ , the optimization problem becomes least squares regression [1]. $\begin{array}{l} (1) & min L (w) = \frac{1}{n} \sum_{i = 1}^{n} φ_{i} (w), \\ (2) & w_{t + 1} = w_{t} - Δ w, \end{array}$ where $w_{t}$ denotes the weight for current iteration t, $w_{t + 1}$ is the weight in the next iteration $t + 1$ , $Δ w$ is used to correct the weights at epoch t. Gradient descent, which is described by the following update rule Eq. (3) for $t = 1, 2, \dots, n$ , is the most basic iterative model. $\begin{array}{l} w_{t + 1} & = w_{t} - η_{t} ▽ L (w_{t}) \\ (3) & = w_{t} - \frac{η_{t}}{n} \sum_{i = 1}^{n} φ_{i} (w_{t}) . \end{array}$ However, gradient descent requires to figure up n derivatives, which is expensive. Thus, a popular modification of GD is stochastic gradient descent (SGD) [2–6]. It processes mini-batches of data at each iteration to update parameters by taking small gradient steps. The update rule of SGD is Eq. (4). At each iteration $t = 1, 2, \dots, n$ , we choose $i_{k}$ randomly from $1, 2, \dots, n$ . $\begin{matrix} (4) & w_{t + 1} = w_{t} - η_{t} ▽ φ_{i_{t}} (w_{t}) . \end{matrix}$ And in order to solve regularized risk minimizations problem, stochastic proximal gradients descent (SPGD) is proposed to solve $\begin{matrix} min L (w) = \frac{1}{n} \sum_{i = 1}^{n} φ_{i} (w) + θ (w), \end{matrix}$ where we achieve lasso by setting $θ (w) = λ | w |$ , and obtain ridge regression by setting $θ (w) = λ ‖ w ‖^{2}$ . The update rule of SPGD is described by Eq. (5) for $t = 1, 2, \dots, n$ . $\begin{matrix} (5) & w_{t + 1} = {prox}_{η_{t} θ} w_{t} - η_{t} ▽ φ_{i_{t}} (w_{t}), \end{matrix}$ where prox is the proximal operator and SPGD is similar to SGD. At each iteration $t = 1, 2, \dots, n$ , we choose $i_{k}$ randomly from $1, 2, \dots, n$ .

The general idea of the algorithms (such as SGD and SPGD) are that gradient computed on subset is to approximate true gradient on the whole datasets, and the algorithms use mini-batches of data instead of all data which causes gradient noise and variance. But if using all data to compute gradient, it will be compute-intensive and require much storage of gradients. Due to the variance introduced by random sampling, SGD obtain a slower convergence rate than GD. This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for GD. To improve the stochastic gradient descent, one need a variance reduction technique, which allows us to use a large rate $η_{t}$ . There are several successful modifications to the gradient descent algorithm developed over the past few years. [7] proposed stochastic average gradients (SAG), stochastic dual coordinate ascent method (SDCA) is proposed by Shalev-Shawarts et al. [8]. Stochastic variance reduced gradient is proposed in [1]. However, both proposals require storage of all gradients or dual variables and is compute-intensive. In this paper we propose a variance reduction technique regarding moving average gradient as average gradient, we keep a version of weight which is close to the optimal weight. thus we only need to preserve current gradient and the previous average gradient. Using variance reduction technique can trade off between computation per iteration and convergence, and using a relatively large value of learning rate. Then, we apply our method to Long Short-Term Memory (LSTM) which shows our method improves the results significantly on two data sets, the IMDB (movie reviews) and SemEval-2016 (sentiment analysis in twitter). In summary, the contributions of this work are as follows:

We can take a large learning rate by using variance reduction technique replacing average gradient with moving average gradient. And by using variance reduction technique can reduce the variance and noise of SGD.

Our method does not require the storage of full gradients and is not compute-intensive, and we only need store the current gradient and the previous average gradient.

The rest of paper is organized as follows. Section 2 proposes our method Stochastic Moving Average Gradient (SMVRG). Section 3 describes the experimental design and discusses the results. The final section concludes the work.

2. Stochastic moving-average variance reduction gradient (SMVRG)

Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge [9]. So we always need a small learning rate due to the variance of SGD. The general idea of SGD is that gradient computed on subset is to approximate true gradient on the whole datasets, and the algorithms use mini-batches of data instead of all data which causes gradient noise and variance. And the gradient noise is caused by variance of magnitude introduced by random sampling. In this paper, we propose a variance reduction technique regarding moving average of gradient as average gradient, thus we only need to preserve current gradient and the previous average gradient. At each time, we keep a version of $\tilde{w}$ which is close to the optimal w. Such as, we can keep the value of $\tilde{w}$ as the immediate optimal w after m SGD iteration. Moreover, we use the moving average gradient as the average gradient, the algorithm of SMVRG is in Table 1. $\begin{matrix} (6) & {\tilde{μ}}_{t} = β {\tilde{μ}}_{t - 1} + (1 - β) ▽ φ_{i_{t}} (\tilde{w}) . \end{matrix}$ Let $g_{1}, g_{2}, \dots, g_{N}$ be the gradients subsequence at time steps, and each of gradient draws from underlying gradient distribution $g_{t} \sim p (g_{t})$ . ${\tilde{μ}}_{t}$ estimate the moving average gradient (also calls first moment estimate), and we have the Eq. (7) $\begin{matrix} (7) & {\tilde{μ}}_{t} \approx E {[▽_{w}]}_{t}, \end{matrix}$ where E is on behalf of expectation, $▽_{w}$ is the gradient of objective function about parameter w. According to the Eq. (7), we can see the moving average ${\tilde{μ}}_{t}$ is the average of gradients. And note the expectation of $▽ φ_{i} (w) - \tilde{μ_{t}}$ over i is zero, and the update rule of our method is as following, and we draw $i_{t}$ from $1, 2, \dots, n$ . $\begin{matrix} (8) & w_{t + 1} = w_{t} - η_{t} (▽ φ_{i_{t}} (w_{t}) - ▽ φ_{i} (\tilde{w}) + {\tilde{μ}}_{t}) . \end{matrix}$

Thus, we can have that the weights are close to the weights on all data set. $\begin{matrix} (9) & E [w_{t}] = w_{t - 1} - η_{t - 1} ▽ L (w_{t - 1}) . \end{matrix}$ And if supposing a random variable $ξ_{t - 1}$ which depends on the weight $w_{t - 1}$ on the iteration $t - 1$ , then we have a another version of SGD in Eq. (10). In Eq. (10), the expectation with respect to $ξ_{t - 1}$ is $E [g_{t} (w_{t - 1}, ξ_{t - 1})] = ▽ L (w_{t - 1})$ . If we let the random variable $ξ_{t - 1} = i_{t}$ , we can think that $g_{t} (w_{t - 1}, ξ_{t - 1}) = ▽ φ_{i_{t}} (w_{t - 1}) - ▽ φ_{i_{t}} (w) + {\tilde{μ}}_{t - 1}$ . $\begin{matrix} (10) & w_{t} = w_{t - 1} - η_{t - 1} g_{t - 1} (w_{t - 1}, ξ_{t - 1}) . \end{matrix}$ So, the update rule Eq. (8) is most the kernel. If both $\tilde{w}$ and $w_{t}$ converge to the optimal value $w_{*}$ , namely $▽ φ_{i} (\tilde{w}) \to ▽ φ_{i} (w_{*})$ , and $▽ φ_{i} (w) \to ▽ φ_{i} (w_{*})$ , then we have: $\begin{array}{l} ▽ φ_{i_{t}} (w_{t - 1}) - ▽ φ_{i_{t - 1}} (w) \\ (11) & \to ▽ φ_{i_{t - 1}} (w_{t - 1}) - ▽ φ_{i_{t}} (w_{*}) \to 0 . \end{array}$ Note that each stage requires $2 m + 1$ gradient computations. Therefore, it is natural to choose $2 m + 1$ to be the same order of n (for some convex problems, one may save the intermediate gradient $▽ φ_{i} (\tilde{w})$ , and thus only $m + 1$ gradient computations are needed). Thus for fair comparison, we compare SGD with our method based on the number of gradient computations.

Table 1
The algorithm of SMVRG

Algorithm: SMVRG, our proposed algorithm for stochastic

optimization. $g_{t}$ is the gradient at the training step t.

η: step size, the value is 0.001.

β: Exponential decay rates for moment estimates, the value is 0.95.

$w_{0}$ : Initial parameter vector

$β_{0} \leftarrow 0$ : Initial exponential decay rates for moment estimates

${\tilde{μ}}_{0} \leftarrow 0$ : Initial first moment estimates

$\tilde{w} = w_{0}$

For $s \in [1, 2, \dots]$

do

Randomly pick $i_{t} \in 1, 2, \dots, n$

${\tilde{μ}}_{t} = β {\tilde{μ}}_{t - 1} + (1 - β) ▽ φ_{i_{t}} (\tilde{w})$

For $t \in [1, 2, \dots, m]$

do

Randomly pick $i_{t} \in 1, 2, \dots, n$ and update weight

$w_{t + 1} = w_{t} - η_{t} (▽ φ_{i_{t}} (w_{t}) - ▽ φ_{i_{t}} (\tilde{w}) + {\tilde{μ}}_{t})$

End

Set $\tilde{w} = w_{t + 1}$

End

Until Meet the stopping criterion.

Algorithm: SMVRG, our proposed algorithm for stochastic
η: step size, the value is 0.001.
β: Exponential decay rates for moment estimates, the value is 0.95.
$w_{0}$ : Initial parameter vector
$β_{0} \leftarrow 0$ : Initial exponential decay rates for moment estimates
${\tilde{μ}}_{0} \leftarrow 0$ : Initial first moment estimates
$\tilde{w} = w_{0}$
For $s \in [1, 2, \dots]$
do
Randomly pick $i_{t} \in 1, 2, \dots, n$
${\tilde{μ}}_{t} = β {\tilde{μ}}_{t - 1} + (1 - β) ▽ φ_{i_{t}} (\tilde{w})$
For $t \in [1, 2, \dots, m]$
do
Randomly pick $i_{t} \in 1, 2, \dots, n$ and update weight
$w_{t + 1} = w_{t} - η_{t} (▽ φ_{i_{t}} (w_{t}) - ▽ φ_{i_{t}} (\tilde{w}) + {\tilde{μ}}_{t})$
End
Set $\tilde{w} = w_{t + 1}$
End
Until Meet the stopping criterion.

3. Experiments

We test the proposed algorithm on well-known benchmark problems for video classification and emotional classification. Using large models and data sets, we demonstrate the proposed algorithm can efficiently solve practical deep learning problem. And SGD is one of the most common training algorithms in use for neural network training, and some algorithms, such as SPGD, and so on. The experiments in this section compare the proposed algorithm above with SGD and SPGD. In this paper, we set $m = 5 (n - 1) + 4$ [1], so the iteration of our method is the number of gradient computation divides n. We us the same parameter initialization when comparing with different optimization algorithms, the results are reported using the same learning rate ( $η = 0.001$ in SGD and SPGD). In this section, we first introduce the data sets and network architectures. Then we analyze the experiments and the performance of our method.

3.1. Datasets

The experiments is conducted on two data sets, the label distributions of two datasets is in Table 2.

IMDB Movie review data set (short for IMDB)1

¹
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

contains 50000 reviews allowing no more than 30 reviews per movie, and each movie review is along with their associated binary sentiment polarity labels. The overall distribution of labels is balanced (25000 positive and 25000 negative). All reviews are grouped into four subsets based on the length of test: (1) short text with the length of less than (including) 140; (2) medium length text between 141 and 200; (3) ordinary length text between 201 and 400. We divided the data set evenly into 25000 for training and 25000 for testing [10].

SemEval-2016 SemEval-2016, namely Sentiment Analysis in Twitter2

http://alt.qcri.org/semeval2016/task4/

is 3-way sentiment polarity classification data set. It is thus a “single-label multi-class” classification (SLMCC) task. All tweets are short because of the length of less than 140 words. We got the whole training tweets which contain of 6000 tweets, we use 80% of the whole training tweets as the train set, and the rest 20% of the whole training tweets as the test set [11].

Table 2

Label distributions of two datasets

Datasets	All	Positive	Negative	Netutral
IMDB	50000	25000	25000	–
SemEval-2016	6000	3094	863	2043

3.2. Network architectures

In this work, we conducted the experiment on LSTM (Long Short-Time Memory) which is a kind of time recurrent neural network. LSTM enforces constant error flow through ‘constant error carrousels’ within special unit, and multiplicative gate units of LSTM learn to open and close access to the constant error flow [12,13]. We trained with 128 hidden units with final softmax output layer on top. Our methods was trained on mini-batches of 32 movie review or tweets per batch for 50 epochs through the training set. And We use standard deterministic cross-entropy objective function relative to parameters as the cost function to evaluate the fitness of trained model. $\begin{matrix} (12) & C = - \frac{1}{n} \sum_{}^{x} [y ln a + (1 - y) ln (1 - a)] . \end{matrix}$ Equation (12) is cross-entropy objective function, where n is total number of training samples, y is the desired output, $a = σ (z)$ is the truth output, $σ (\cdot)$ is the activate function, in this paper, we use sigmoid function. And $z = \sum w_{j} * x_{j} + b$ , w, x and b is the weights, inputs and biases of neural network [14].

3.3. Result

3.3.1. Dropout

The critical point of dropout is to randomly drop units along with their connections from the neural network during training which avoids over-fitting from co-adapting too much. According to [15], during training, dropout samples from an exponential number of different “thinned” networks; at test epoches, dropout approximates easily the effect of averaging the predictions of all these thinned networks dropped some units along with their connection by simply using a single unthinned network that has smaller weights.

We pre-process the IMDB movie reviews data set and SemEval-2016 (Sentiment Analysis in Twitter) data set into bag-of-words (Bow) feature vector for each review and Twitter which are highly sparse. As suggested in [16], we apply stochastic regularization methods (in this paper we use 50% dropout noise), which are an effective way to prevent over-fitting and often used in practice due to their simplicity, to the Bow features during the training. Figure 1 is the histograms of our method for 50 epoches with dropout and on both date sets. Figure 2 compares the performance (including train accuracies and train error) of dropout with SGD and our methods without dropout. Comparing the curves of without dropout and using dropout, with dropout, the train accuracies are larger and test errors is lower. Thus, dropout is a useful technique for improving the performance of neural networks. Dropout [16,17] is applied during training to prevent over-fitting. And, we compare the effectiveness of our methods to other methods on LSTM trained with dropout noise. The x-axis of figures of our method is gradient computations divided by n.

Fig. 1.

Left: the accuracies and errors of our method on IMDB for 50 epochs; Right: the accuracies and errors of our method on SemEval-2016 for 50 epochs. Top: our method on train set; Bottom: our method on test set.

Fig. 2.

Top: the train accuracies and train errors of SGD, SMVRG on IMDB for 50 epochs; Bottom: the train accuracies and train errors of SGD, SMVRG on SemEval-2016 for 50 epochs.

Fig. 3.

Top: the train accuracies and train errors of SGD, SPGD and SMVRG on IMDB for 50 epochs; Bottom: the train accuracies and train errors of SGD, SPGD and SMVRG on SemEval-2016 for 50 epochs.

3.3.2. Classification

In Fig. 3, we compare SGD, SPGD and our methods in the train set from train accuracies and train loss, the best common SGD method does worse in this case, SMVRG performances well for the 50 epochs of training. And the empirical performance of our algorithm with variance reduction technique is consist with our theory and our algorithm obtains the best performance. Figure 4 indicates the strength of SMVRG and the weakness of SGD. That is, training loss of SGD with a relatively large learning rate drops fast at first, but it oscillates above the minimum. With a small learning rate, the minimum may be approached eventually, but it will convergence slowly. However, using a relatively large value of learning rate, SMVRG smoothly goes down faster than SGD and convergences fast. Figure 5 shows parameter updates $Δ w_{t}$ for 2 randomly selected weight of the network. The parameter updates tend towards zero near the end of training and this occurs smoothly for each of the weight matrices effectively operating as an annealing schedule was present. It can be shown that asymptotically the variance of our method goes to zero, and thus faster convergence can be achieved.

Fig. 4.

Top: the train accuracies and train errors of SGD with different learning rate and SMVRG on IMDB for 50 epochs; Bottom: the train accuracies and train errors of SGD with different learning rate and SMVRG on SemEval-2016 for 50 epochs.

Fig. 5.

Top: Parameters update $Δ w_{t}$ for 2 randomly selected weight of the network of SMVRG on IMDB for 50 epochs; Bottom: Parameters update $Δ w_{t}$ for 2 randomly selected weight of the network of SMVRG on SemEval-2016 for 50 epochs.

4. Conclusion

In this paper, we have introduced a variance reduction technique for SGD regarding moving average gradient as average gradient (SMVRG). SMVRG is a simple and computationally efficient method for the trade-off of fast computation per iteration and slow convergence for SGD versus solw computation per iteration and fast convergence for GD, and it can take a large learning rate by using variance reduction technique replacing average gradient with moving average of gradient. Using LSTM (Long Short-Term Memory) which is a kind of time recurrent neural network, we show promising results compared to other methods on two real data sets movie review data set (IMDB) and SemEval-2016 (Sentiment Analysis in Twitter). And, our methods can improve the training accuracies and train error to some extend, and have fast rate of convergence.

Footnotes

Acknowledgements

The work was supported by the Fundamental Research Funds For the Central Universities (No. XDJK2 017D059), Scientific and Technological Research Program of Chongqing University of Education (No. KY2016TZ02 and No. 2017XJPT07), Key Research Program of Chongqing Education Science 13th Five-Year Plan 2017 (No. 2017-GX-139). Li Li is the corresponding author for the paper.

References

Johnson and

Zhang , Accelerating stochastic gradient descent using predictive variance reduction, in: Advances in Neural Information Processing Systems, 2013, pp. 315–323.

Dauphin,

Pascanu,

Gulcehre,

Cho,

Ganguli and

Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), Vol. 2, MIT Press, Cambridge, MA, 2013, pp. 2933–2941.

Robinds and

Monro, A stochastic approximation method, Annals of Mathematical Statistics 22 (1951), 400–407. doi:10.1214/aoms/1177729586.

Darken,

Chang and

Moody, Learning rate schedules for faster stochastic gradient search, in: Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, IEEE, New York, 1992, pp. 1–11. doi:10.1109/NNSP.1992.253713.

R.S.

Sutton, Two problems with backpropagation and other steepest-descent learning procedures for networks, in: Proceedings of the Eighth Annual Conference of the Cognitive Science Society, Erlbaum, Hillsdale, NJ, 1986.

Bottou, Stochastic gradient learning in neural networks, in: Proceedings of Neuro-Nimes, Vol. 91, 1991, pp. 687–706.

Schmidt,

Le Roux and

Bach, Erratum to minimizing finite sums with the stochastic average gradient, Mathematical Programming 162(1–2) (2013), 113. doi:10.1007/s10107-016-1051-1.

Shalev-Shwartz and

Zhang, Stochastic dual coordinate ascent methods for regularized loss minimization, Journal of Machine Learning Research 14(1) (2013), 567–599.

Schaul,

Zhang and

Lecun, No more pesky learning rates, in: Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 28, 2013, pp. 343–351.

10.

A.L.

Maas,

R.E.

Daly,

P.T.

Pham,

Huang,

A.Y.

Ng and

Potts, Learning word vectors for sentiment analysis, in: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT’11), Vol. 1, Association for Computational Linguistics, Stroudsburg, PA, 2011, pp. 142–150.

11.

Nakov,

Ritter,

Rosenthal,

Sebastiani and

Stoyanov, Evaluation measures for the SemEval-2016 ask 4 “Sentiment Analysis in Twitter”, available at http://alt.qcri.org/semeval2016/task4/.

12.

Hochreiter and

Schmidhuber, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780. doi:10.1162/neco.1997.9.8.1735.

13.

Guo, Backpropagation through time, Unpublished manuscript, Harbin Institute of Technology, 2013.

14.

Steijvers and

Grunwald, A recurrent network that performs a context-sensitive prediction task, in: Proceedings of the 18th Annual Conference of the Cognitive Science, Morgan Kaufmann, San Matteo, CA, 1996, pp. 335–339.

15.

Srivastava,

Hinton,

Krizhevsky,

Sutskever and

Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research 15 (2014), 1929–1958.

16.

Wang and

C.D.

Manning, Fast dropout training, in: Proceedings of the 30th International Conference on Machine Learning (ICML’13), Vol. 28, ACM, New York, 2013, pp. 118–126.

17.

Babaeizadeh,

Smaragdis and

R.H.

Campbell, NoiseOut: A simple way to prune neural networks, arxiv preprint, 2016, available at https://arxiv.org/pdf/1611.06211.