Adam:A METHOD FOR STOCHASTIC OPTIMIZATION paper review

The ADAM optimization algorithm is an extension of stochastic gradient descent which according to Wikipedia is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization since it replaces the actual gradient by an estimate thereof. The ADAM optimization technique is widely used because it achieves good results fast.

The research paper ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION was authored by Diederik P. Kingma of the University of Amsterdam, OpenAI, and Jimmy Lei Ba of the University of Toronto.

The authors highlighted the advantages of the ADAM optimization method to be:

It is straightforward to implement.

It is computationally efficient.

It has little memory requirement.

It is invariant to diagonal rescaling of the gradients.

It is well suited for problems that are large in terms of data and/or parameters.

The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients.

The hyper-parameters have intuitive interpretations and typically require little tuning.

In this paper, ADAM is a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients. The name adam is derived from Adaptive moment estimation.

The authors describe ADAM as a method designed to combine the advantages of two recently popular methods: I)AdaGrad and ii)RMSProp.

I) AdaGrad(Adaptive Gradient Algorithm) works well with sparse gradient by maintaining a per-parameter learning rate that improves performance on problems.

II) RMSProp(Root Mean Square Propagation) works well in online and non-stationary settings. i.e it maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing).

Adam Configuration Parameters

Alpha: Also referred to as the learning rate or step size. The proportion that weights are updated.

Beta1: The exponential decay rate for the first moment estimates.

Beta2: The exponential decay rate for the second-moment estimates.

Epsilon: This is a very small number to prevent any division by zero in the implementation.

The Adam paper suggests that good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999, and epsilon=10−8.

Learning rate decay can also be used with Adam.

The paper uses a decay rate alpha = alpha/sqrt(t) updated each epoch (t) for the logistic regression demonstration.

EXPERIMENT: MULTI-LAYER NEURAL NETWORKS.

Multi-layer neural networks are powerful models with non-convex objective functions. ADAM often outperforms other methods in non-convex problems - A neural network model with 2 fully connected hidden layers with 1000 hidden units each and RELU activation are used for this experiment with a minibatch size of 128. L2 weight decay and dropout noise are used on the parameters to prevent overfitting.

The sum-of-functions(SFO) methods used to train the multi-layer neural networks in comparison to ADAM are: - AdaGrad - RMSProp - SGDNesterov - AdaDelta ADAM shows better convergence than other methods. The figure above shows that ADAM makes faster progress in terms of both the number of iterations and wall clock time. According to the authors, SFO is 5-10x slower per iteration compared to ADAM due to the cost of updating curvature information. SFO has a memory requirement that is linear in the number of mini-batches.

EXPERIMENT: CONVOLUTIONAL NEURAL NETWORKS

This experiment is to show the effectiveness of ADAM in deep CNN. A smaller learning rate for the convolution layers is often used in practice when applying SGD. The CNN architecture used has three alternating stages of 5x5 convolution filters and 3x3 max pooling with a stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s). The mini-batch size is set to 128, just like the previous experiments. The input images are pre-processed by whitening, and drop-out noise is applied to the input and fully connected layer.

From the figure above, we can see that ADAM and Adagrad make rapid progress lowering the cost in the initial stage of the training(i.e first 3 epochs). Meanwhile, Adam and SGD eventually converge considerably faster than Adagrad for CNNs. Adam shows marginal improvement over SGD with momentum, it adapts learning rate scale for different layers instead of handpicking manually as in SGD.

Extensions

ADAMAX

AdaMax is a variant of Adam based on the infinity norm and also a surprisingly stable algorithm. It is surprising because ADAM variants are usually numerically unstable with a large value of P. However, in Adamax's case p → ∞. Good default settings for the tested machine learning problems are α = 0.002, β1 = 0.9 and β2 = 0.999. With β.

Temporal Averaging

Averaging has been previously shown in Moulines & Bach (2011), Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988 to improve the convergence of standard stochastic gradient descent.

Conclusion

The ADAM method is aimed towards machine learning problems with large datasets or high dimensional parameter spaces. The experiments confirm the analysis of the rate of convergence in complex problems. ADAM was found to be robust and well-suited to a wide range of non-convex optimization problems in the field of machine learning.

While this article tries to give a general overview of the research paper "ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION", you might still need to go through the paper if you haven't. Link here

Paper Review; ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

Extensions

Conclusion