An Activation Function determines whether or not a neuron is activated. This means that it will use simpler mathematical operations to determine whether the neuron’s input to the network is relevant or not throughout the prediction phase.
But, in our previous article, we already discussed what activation function is and why it is important. Now the main subject will be about different types of activation functions that are being utilized in neural networks and deep learning. We will dive into what are the most popular functions and their pros and cons.
There are 3 types of activation functions:
- Binary – A threshold value determines whether a neuron should be triggered or not in a binary step function.
The input to the activation function is assessed to a threshold; if it is higher, the neuron is activated; if it is lower, the neuron is deactivated, and its output is not passed on to the next hidden layer.
- It can’t provide multi-value outputs, therefore it can’t be utilized to solve multi-class classification issues, for example.
- The step function’s gradient is zero, which makes the backpropagation procedure difficult.
- Linear – commonly known as the Identity Function, is a proportional activation function.
- Backpropagation is not feasible since the function’s derivative is a constant and has no relationship to the input x.
- If a linear activation function is applied, all layers of the neural network will collapse into one. The last layer of a neural network will always be a linear function of the first layer, regardless of how many levels there are. A linear activation function effectively reduces the neural network to a single layer.
- Non-linear – The following drawbacks of linear activation functions are overcome by non-linear activation functions:
- Backpropagation is feasible since the derivative function is now connected to the input, and it is possible to go back and figure out which weights in the input neurons may offer a better prediction.
- They allow many layers of neurons to be stacked since the output is a non-linear mixture of input transmitted through numerous levels. In a neural network, any output may be represented as a functional computation.
Let’s take a glimpse at 3 alternative activation functions for non-linear neural networks and their features.
Rectified Linear Unit (ReLU)
Currently, the ReLU is the most often utilized activation function. Since nearly all deep learning and convolutional neural networks employ it.
Both the function and its derivative have the property of monotonicity in their behavior.
The good sides of ReLU
In reality, ReLU speeds up gradient descent to the global minimum of the loss function compared to other activation functions. This is because of its straight, non-saturating nature.
While other activation functions (tanh and sigmoid) need computationally expensive procedures like exponentials and the like, we can easily construct ReLU by simply thresholding a vector of data at zero.
The bad side of ReLU
However, activating ReLU is also having issues. During training, the network neurons might be highly sensitive and potentially die if the input values are less than zero. What’s the significance of this? It’s possible (but not necessary) that during weight updates, the weights are modified in such a way that the inputs to particular neurons are always negative. This implies that the hidden values of these neurons are always 0 and do not contribute to the training process. As a result, the gradient passing through these ReLU neurons will be zero going forward. The neurons have essentially died, as the expression goes.
For example, it’s not uncommon to find that 20 to 50 percent of a neural network activated using ReLU is no longer functional. As a result, the neurons in question will never fire in the training data set.
Sigmoid was what ReLU is today. It was the most frequent activation function you could have come across a few years ago. By using the sigmoid function, the incoming data may be transformed into an input value range between 0 and 1.
In practice, the sigmoid nonlinearity has lately fallen out of favor and is seldom used any longer due to two significant disadvantages.
When using the sigmoid, it’s important to note that the activation of neurons reaches a saturation point either at the end of zero or one.
For these blue regions, the sigmoid function’s derivative shrinks to an extremely tiny value. The gradient of the loss function would be very tiny because of the near-zero derivative, which would prevent the weights from being updated and therefore the overall learning process.
The outputs of the function are not zero-centered, which is another disadvantage of sigmoid activation. As a result, neural network training becomes more challenging and unreliable.
With a difference in the output range of -1 to 1, the Tanh function is extremely similar to the sigmoid/logistic activation function and even has the same S-shape. The greater the input (more positive), the closer the output to 1.0, and the smaller the input (more negative), the closer the output to -1.0.
The following are some of the benefits of utilizing this activation function:
- We can simply translate the output values as highly negative, neutral, or very positive because the tanh activation function’s output is Zero centered.
- Because its values range from -1 to, it’s commonly employed in hidden layers of neural networks. As a result, the hidden layer’s mean is 0 or extremely near to it. It aids in data centering and makes learning the following layer much simpler.
The right activation function for you
You must match your output layer’s activation function to the type of prediction issue you’re solving—specifically, the type of predicted variable.
As a general guideline, start with the ReLU activation function and then go on to other activation functions if ReLU isn’t giving you the best results.