The change in all weights in relation to the change in error is measured by a gradient. A gradient is also known as the slope of a function. The steeper the slope and the faster a model can learn, the higher the gradient. If the slope is zero, however, the model will cease learning. A gradient is a partial derivative with regard to its inputs in mathematics.

• A gradient is a derivative of a function with several input variables in machine learning. The gradient, often known as the slope of a function in mathematics, simply quantifies the change in all weights in relation to the change in error.

## Gradient descent and cost function

Plotting the cost function while the optimization proceeds is an excellent approach to ensure that gradient descent works appropriately. On the x-axis, write the number of iterations, and on the y-axis, write the cost-function value. This allows you to immediately detect how suitable your learning rate is by seeing the value of your cost function after each round of gradient descent. You may just experiment with different values and plot them all together.

• The cost function should decrease with each iteration if gradient descent is running properly.

Gradient descent has converged when it can no longer lower the cost function and remains more or less at the same level. The number of iterations required for gradient descent to converge might vary significantly. It might take ten iterations, 10,000, or even ten million, making the number of iterations until convergence difficult to predict ahead of time.

There are various methods that can inform you if gradient descent has converged automatically, but you must first select a convergence threshold, which is likewise difficult to predict. As a result, simple plots are the chosen convergence method.

The advantage of using graphs to monitor gradient descent is that it makes it easy to see if it isn’t working properly, such as if the cost function is growing. When applying gradient descent, a high learning rate is the most common cause of a growing cost function. Reduce the learning rate if the figure displays the learning curve moving up and down without genuinely reaching a lower position. When first starting out using gradient descent on a task, simply experiment with different low learning rates to see which one works best.

## Three types of gradient descent

Gradient descent is divided into three categories, which differ mostly in the quantity of data they use:

Vanilla gradient descent evaluates the error for each example in the training dataset, but the model is updated only after all of the training instances have been reviewed. This entire procedure is referred to as a training period since it resembles a cycle.

Batch gradient descent has a number of advantages, including being computationally cheap and producing a stable error gradient and convergence. The stable error gradient might occasionally result in a state of convergence that isn’t the best the model can accomplish, which is one of the downsides. It also necessitates the algorithm having access to the complete training dataset in memory.