 # optim.Adam vs optim.SGD. Let’s dive in

First let?s talk what do you mean by optimising a model. Well simply we want the model to get trained to reach the state of maximum accuracy given resource constraints like time, computing power, memory etc. Optimization has a broad scope and you can also tweak the architecture to get a better model. But that is something which comes with intuition developed by experience. Given a certain architecture, in pytorch a torch.optim package implements various optimization algorithms. We would discuss here two most widely used optimizing techniques stochastic gradient descent (optim.SGD) and Adam?s Method (optim.Adam).

1. SGD:

We know that gradient descent is the rate of loss function w.r.t the weights a.k.a model parameters. The loss function can be a function of the mean square of the losses accumulated over the entire training dataset. Hence the weights are updated once at the end of each epoch. This results in reaching the exact minimum but requires heavy computation time/epochs to reach that point.

On the other hand in SGD the weights are updated after looping via each training sample.

for i in range(nb_epochs):np.random.shuffle(data) for example in data: params_grad = evaluate_gradient(loss_function, example, params) params = params – learning_rate * params_grad

From official documentation of pytorch SGD function has the following definition

torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)

Arguments:

• params (iterable) ? iterable of parameters to optimize or dicts defining parameter groups
• lr (float) ? learning rate
• momentum (float, optional) ? momentum factor (default: 0)
• weight_decay (float, optional) ? weight decay (L2 penalty) (default: 0)
• dampening (float, optional) ? dampening for momentum (default: 0)
• nesterov (bool, optional) ? enables Nesterov momentum (default: False)

Most of the arguments stated above I believe are self explanatory except momemtum and nesterov. Stuck in a local minima

Nesterov Momentum

This type of momemtum has a slightly different methodology. Here weights update depend both on the classical momemtun and the gradient step in future with the present momemtum. The diagram below clearly depicts I?m trying to say. x_ahead = x + mu * v# evaluate dx_ahead (the gradient at x_ahead instead of at x)v = mu * v – learning_rate * dx_aheadx += v

In fact it is said that SGD+Nesterov can be as good as Adam?s technique.

You can totally skip the details because in code you only need to pass values to the arguments. But it is good to know in dept of everything we want to learn.

From the official documentation

Arguments:

• params (iterable) ? iterable of parameters to optimize or dicts defining parameter groups
• lr (float, optional) ? learning rate (default: 1e-3)
• betas (Tuple[float, float], optional) ? coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
• eps (float, optional) ? term added to the denominator to improve numerical stability (default: 1e-8)
• weight_decay (float, optional) ? weight decay (L2 penalty) (default: 0)
• amsgrad (boolean, optional) ? whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

Adam?s method considered as a method of Stochastic Optimization is a technique implementing adaptive learning rate. Whereas in normal SGD the learning rate has an equivalent type of effect for all the weights/parameters of the model.

Hm, let me show you the actual equations for Adam?s to give you an intuition of the adaptive learning rate per paramter.

m = beta1*m + (1-beta1)*dxmt = m / (1-beta1**t)v = beta2*v + (1-beta2)*(dx**2)vt = v / (1-beta2**t)x += – learning_rate * mt / (np.sqrt(vt) + eps)

We see that Adam somewhat implies two tricks one is momemtum