The higher performance of convolutional neural networks with audio, picture, or voice, or inputs sets them apart from conventional neural networks. They are divided into three sorts of layers:

  • Convolutional layer
  • Fully-connected layer
  • Pooling layer

A convolutional network’s initial layer is the convolutional layer. While further convolutional layers or pooling layers can be added after convolutional layers, the fully-connected layer is the last layer. The CNN becomes more complicated with each layer, detecting larger areas of the picture. Earlier layers concentrate on basic elements like colors and borders. As the visual data travels through the CNN layers, it begins to distinguish bigger components or features of the item, eventually identifying the target object.

Convolutional layer

The convolutional layer is the most important component of a CNN since it is where the majority of the processing takes place. It requires input data, a filter, and a feature map, among other things. Let’s pretend the input is a color picture, which is made up of a 3D matrix of pixels. This implies the input will have three dimensions: height, width, and depth, which match the RGB color space of a picture.

A feature detector, also known as a kernel or a filter, will traverse over the image’s receptive fields, checking for the presence of the feature. Convolution is the term for this procedure.

The feature detector is a 2-D weighted array that represents a portion of the picture. The filter size, which can vary in size, is usually a 3×3 matrix, which also affects the size of the receptive field. After that, the filter is applied to a portion of the picture, and a dot product between the input pixels and the filter is calculated. After that, the dot product is loaded into an output array.

The filter then moves by a stride, and the procedure is repeated until the kernel has swept across the whole picture. A feature map, activation map, or convolved feature is the ultimate result of a sequence of dot products from the input and the filter.

Each pixel value in the input picture does not have to be connected to each output value in the feature map. It simply has to be connected to the receptive field, which is where the filter is applied. Convolutional and pooling layers are typically referred to as “partially connected” layers since the output array does not need to map directly to each input value. This property, however, can also be referred to as a local connection.

Backpropagation and gradient descent is used to change some parameters, such as weight values. However, there are three hyperparameters that determine the output volume size that must be specified before the neural network can be trained. These are some of them:

  • The depth of the output is affected by the number of filters used. Three unique filters, for example, would result in three different feature maps, resulting in a depth of three.
  • The kernel’s stride is the number of pixels it traverses across the input matrix. Although stride values of two or more are uncommon, a bigger stride results in a lesser output.
  • When the filters don’t fit the input image, zero-padding is commonly utilized. All members outside of the input matrix are set to zero, resulting in a bigger or equal-sized output. Padding comes in three varieties:
  1. No padding is also known as valid padding. If the dimensions do not align, the final convolution is discarded.
  2. Padding is the same: This padding guarantees that the output layer matches the input layer in size.
  3. Padding is complete: By padding the input’s border with zeros, this sort of padding enhances the output’s size.

A CNN performs a ReLU modification to the feature map after each convolution operation, imparting nonlinearity to the model.

Following the initial convolution layer, another convolution layer can be added. When this happens, the CNN’s structure might become hierarchical, since subsequent layers can perceive pixels within earlier layers’ receptive fields.

Finally, the convolutional layer turns the picture into numerical values, which the neural network can analyze and extract meaningful patterns from.