Data is growing in volume, velocity, and diversity. Modern technologies such as machine learning, deep learning, data mining, and big data analytics make use of this precious resource. Although these approaches have become standard tools for dealing with any challenging problem in the fields of video analytics, image processing, speech recognition, and natural language processing, memory constraints still exist.
Researchers grapple with the restricted memory bandwidth of the devices employed since many deep network models are resource-intensive. The training phase of deep learning is the most resource or compute-intensive.
- GPUs are commonly used to train these heavy machine learning or deep learning workloads because they provide several advantages over non-specialized hardware.
They’re designed for parallelizing training workloads, performing concurrent computing operations, and freeing up the CPU for other tasks. However, the major reason for using GPUs is their high memory bandwidth, which is insufficient for bigger networks.
Tensor processing on top-of-the-line GPUs has risen by 32 times in the last five years, while total available memory has only increased by 2.5 times. As a result, memory footprint minimization strategies for DNNs are essential.
To address memory bottlenecks, either the network architecture must be changed or the training must be scaled to numerous nodes. This study provides a detailed evaluation of the most recent software-based techniques for memory reduction in neural networks.
Memory footprint reduction techniques for CNN
The two basic strategies for decreasing memory footprints are efficient memory management and lowering memory demand. In this part, we’ll look at some of the most cutting-edge approaches used in CNNs and RNNs, as well as their performance gains.
Convolutional Neural Networks (CNNs)
CNN and its variations are widely recognized as the most efficient deep learning models. CNNs are primarily employed in computer vision and image processing jobs, and they have produced several state-of-the-art results. Convolutional (CONV) layers, subsampling or so-called pooling (POOL) layers, fully-connected (FC) layers, and activation (ACTV) layers are some of the most common types of layers found in CNN.
- Mixed memory CNN – Memory is virtualized in this architecture as CNNs execute calculations. In general, programmers spend the majority of their coding effort optimizing memory. However, this approach handles optimization and allows developers to focus on network architecture. The notion of memory virtualization lies at the heart of this architecture. Virtualization is accomplished by moving memory between the host and the device, which appears to be a time-consuming process.
The model, on the other hand, is clever enough to choose the set of operations without affecting performance. In comparison to regular CNN, this approach saves 98 percent memory.
- CPU offloading – When hidden activations are computed, they are sent to the CPU, freeing GPU RAM for layer computations a pass forward. Activations are then returned to the GPU. However, there is a significant problem with this strategy in which inputs transfers and calculations are efficiently overlapping to reduce the amount of wall time overhead caused by the extra data transmissions.
- CNN accelerator – Memory needs and memory bandwidth for different networks, as well as different levels within a network, can vary by order of magnitude. As a result, developing fast and efficient hardware for all CNN applications becomes challenging. Both on-chip and off-chip memory bandwidth are used by the CNN accelerator. On-chip memory aids in the reduction of costly off-chip memory accesses. The main restriction that has been addressed is determining the amount of on-chip memory required to ensure that each activation or weight is only accessed off-chip once per layer.
To obtain the most efficient performance out of hardware, big CNNs require a lot of resources, such as specialized GPU and highly optimized implementations. Because GPU memory is a key constraint, the amount of both inputs and model architecture is constrained during CNN training. Activations of each layer are stored in GPU memory and used later for weight gradients. As a result, lower layer activations are kept idle in GPU memory throughout forwarding and backward calculations over higher levels.