The change in all weights in relation to the change in error is measured by a gradient. A gradient is also known as the slope of a function. The steeper the slope and the faster a model can learn, the higher the gradient. If the slope is zero, however, the model will cease learning. A gradient is a partial derivative with regard to its inputs in mathematics.

  • A gradient is a derivative of a function with several input variables in machine learning. The gradient, often known as the slope of a function in mathematics, simply quantifies the change in all weights in relation to the change in error.

Gradient descent and cost function

Plotting the cost function while the optimization proceeds is an excellent approach to ensure that gradient descent works appropriately. On the x-axis, write the number of iterations, and on the y-axis, write the cost-function value. This allows you to immediately detect how suitable your learning rate is by seeing the value of your cost function after each round of gradient descent. You may just experiment with different values and plot them all together.

  • The cost function should decrease with each iteration if gradient descent is running properly.

Gradient descent has converged when it can no longer lower the cost function and remains more or less at the same level. The number of iterations required for gradient descent to converge might vary significantly. It might take ten iterations, 10,000, or even ten million, making the number of iterations until convergence difficult to predict ahead of time.

There are various methods that can inform you if gradient descent has converged automatically, but you must first select a convergence threshold, which is likewise difficult to predict. As a result, simple plots are the chosen convergence method.

The advantage of using graphs to monitor gradient descent is that it makes it easy to see if it isn’t working properly, such as if the cost function is growing. When applying gradient descent, a high learning rate is the most common cause of a growing cost function. Reduce the learning rate if the figure displays the learning curve moving up and down without genuinely reaching a lower position. When first starting out using gradient descent on a task, simply experiment with different low learning rates to see which one works best.

Three types of gradient descent

Gradient descent is divided into three categories, which differ mostly in the quantity of data they use:

Batch gradient descent (Vanilla gradient descent)

Vanilla gradient descent evaluates the error for each example in the training dataset, but the model is updated only after all of the training instances have been reviewed. This entire procedure is referred to as a training period since it resembles a cycle.

Batch gradient descent has a number of advantages, including being computationally cheap and producing a stable error gradient and convergence. The stable error gradient might occasionally result in a state of convergence that isn’t the best the model can accomplish, which is one of the downsides. It also necessitates the algorithm having access to the complete training dataset in memory.

Stochastic gradient descent

SGD accomplishes this for each training sample inside the dataset, updating the parameters one by one. This can make SGD quicker than batch gradient descent, depending on the situation. One advantage is that the regular updates allow us to keep track of our progress in great detail.

The frequent updates, on the other hand, take longer to compute than the batch gradient descent method. Furthermore, the frequency of such updates might generate noisy gradients, causing the error rate to leap about rather than gradually reduce.

Mini-batch gradient descent

Because it combines the ideas of SGD and batch gradient descent, mini-batch gradient descent is the preferred technique. It simply divides the training dataset into tiny groups and updates each batch separately. This strikes a compromise between stochastic gradient descent’s resilience and batch gradient descent’s efficiency.

Mini-batch sizes typically vary from 50 to 256, but like with any other machine learning approach, there is no hard and fast rule because it changes depending on the application. This is the most frequent kind of gradient descent in deep learning and is the go-to technique when training a neural network.