The ADAM optimization algorithm is an extension of stochastic gradient descent which according to Wikipedia is an iterative method for optimizing an objective function with suitable smoothness properties. It can be regarded as a stochastic approximation of gradient descent optimization since it replaces the actual gradient by an estimate thereof.
The ADAM optimization technique is widely used because it achieves good results fast.
The research paper ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION was authored by Diederik P. Kingma of the University of Amsterdam, OpenAI, and Jimmy Lei Ba of the University of Toronto.
The authors highlighted the advantages of the ADAM optimization method to be:
The authors describe ADAM as a method designed to combine the advantages of two recently popular methods: I)AdaGrad and ii)RMSProp.
I) AdaGrad(Adaptive Gradient Algorithm) works well with sparse gradient by maintaining a per-parameter learning rate that improves performance on problems.
II) RMSProp(Root Mean Square Propagation) works well in online and non-stationary settings. i.e it maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing).
Adam Configuration Parameters
The Adam paper suggests that good default settings for the tested machine learning problems are alpha=0.001, beta1=0.9, beta2=0.999, and epsilon=10−8.
Learning rate decay can also be used with Adam.
The paper uses a decay rate alpha = alpha/sqrt(t) updated each epoch (t) for the logistic regression demonstration.
EXPERIMENT: MULTI-LAYER NEURAL NETWORKS.
Multi-layer neural networks are powerful models with non-convex objective functions. ADAM often outperforms other methods in non-convex problems - A neural network model with 2 fully connected hidden layers with 1000 hidden units each and RELU activation are used for this experiment with a minibatch size of 128. L2 weight decay and dropout noise are used on the parameters to prevent overfitting.The sum-of-functions(SFO) methods used to train the multi-layer neural networks in comparison to ADAM are: - AdaGrad - RMSProp - SGDNesterov - AdaDelta ADAM shows better convergence than other methods. The figure above shows that ADAM makes faster progress in terms of both the number of iterations and wall clock time. According to the authors, SFO is 5-10x slower per iteration compared to ADAM due to the cost of updating curvature information. SFO has a memory requirement that is linear in the number of mini-batches.
EXPERIMENT: CONVOLUTIONAL NEURAL NETWORKS
This experiment is to show the effectiveness of ADAM in deep CNN. A smaller learning rate for the convolution layers is often used in practice when applying SGD. The CNN architecture used has three alternating stages of 5x5 convolution filters and 3x3 max pooling with a stride of 2 that are followed by a fully connected layer of 1000 rectified linear hidden units (ReLU’s). The mini-batch size is set to 128, just like the previous experiments. The input images are pre-processed by whitening, and drop-out noise is applied to the input and fully connected layer. From the figure above, we can see that ADAM and Adagrad make rapid progress lowering the cost in the initial stage of the training(i.e first 3 epochs). Meanwhile, Adam and SGD eventually converge considerably faster than Adagrad for CNNs. Adam shows marginal improvement over SGD with momentum, it adapts learning rate scale for different layers instead of handpicking manually as in SGD.Extensions
Conclusion
The ADAM method is aimed towards machine learning problems with large datasets or high dimensional parameter spaces. The experiments confirm the analysis of the rate of convergence in complex problems. ADAM was found to be robust and well-suited to a wide range of non-convex optimization problems in the field of machine learning.
While this article tries to give a general overview of the research paper "ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION", you might still need to go through the paper if you haven't. Link here