Deep neural networks have proven to be effective in a variety of activities. GPUs, or graphics processing units, have played a key role in the actual implementation of deep neural networks. Deep neural network training and use generate a lot of calculations, which lend themselves to efficient parallel implementations. Researchers may investigate substantially greater capacity networks and train them on larger datasets because of the efficiency afforded by these implementations. This has resulted in significant improvements in tasks such as speech recognition and picture classification, to name a few. Parallel implementations on GPUs have also aided deep neural networks for voice recognition.

Convolutional Neural Networks (CNNs) are a popular and effective deep network type. Dense kernels, as opposed to standard dense linear algebra methods, are used to compute convolutional neural networks.

As a result, contemporary deep learning frameworks include a set of unique kernels for basic operations like tensor convolutions, activation functions, and pooling. When training a CNN, these procedures account for the majority of the calculation and consequently the majority of its execution time. The deep learning community has found optimal implementations of these kernels, but when the underlying architectures improve, these kernels will need to be re-optimized, which is a considerable investment.

To achieve acceptable performance, optimizing these kernels necessitates a thorough grasp of the underlying processor architecture, as well as careful data movement scheduling, on-chip memory location, register blocking, and other improvements.

Goals of CuDNN

One of the main aims of cuDNN is to make its APIs available to the whole community of neural network frameworks. As a result, users of cuDNN are not obligated to employ any certain software framework or data architecture.

We give lower-level computational primitives rather than a layer abstraction to ease integration with current deep learning frameworks, each with its own abstractions.

Functions that perform rudimentary actions on data stored in user-controlled buffers make up the majority of the API. The library integrates easily with other frameworks because of its low-level API.

In single and double-precision floating-point arithmetic, cuDNN offers forward and backward propagation variations of all its algorithms. Convolution, pooling, and activation functions are among them.

Variable data layout and strides, as well as indexing of sub-sections of input pictures, are all possible with this module. It also comes with a suite of auxiliary tensor transformation methods that make manipulating 4d-tensors a breeze.

Other widely used deep learning functions are also available in cuDNN.

  • Sigmoid, Rectified Linear and Hyperbolic Tangent are three regularly utilized neuron activation functions.

It comes with a softmax method that, by default, scales each element in a numerically stable manner to avoid overflow in intermediate results. Softmax can be calculated per picture over the three feature map dimensions of height, width, and height, or per spatial location, per image across the feature map dimension. cuDNN has average and maximum pooling operations, as well as a collection of tensor transformation methods with optional broadcasting, such as those that add tensors.

By offering flexible, well-optimized versions of these regularly used functions, the purpose of supplying these functions is to decrease the amount of parallel code necessary for deep learning frameworks. Using cuDNN and cuBLAS, it is feasible to construct algorithms that train typical convolutional neural networks without writing any parallel code.