TPU

What is a Tensor Processing Unit?

A tensor processing unit (TPU)—also known as a TensorFlow processing unit—is a special-purpose machine learning accelerator. It’s a processing IC created by Google to handle TensorFlow neural network processing. TPUs are application-specific integrated circuits (ASICs) that are used to accelerate particular machine learning tasks by putting processing elements—small DSPs with local memory—on a network and allowing them to connect and transport data between them.

TensorFlow is an open-source machine learning framework that may be used for image classification, object identification, language modeling, and speech recognition.

TPUs have efficient model libraries, on-chip high bandwidth memory (HBM), and scalar, vector, and matrix units in each core (MXUs). The MXUs process the data at a rate of 16K multiply-accumulate operations per cycle. Bfloat16 simplifies 32-bit floating-point input and output. User calculations (XLA operations) are handled individually by each core. Google provides Cloud TPUs for use on their servers.

Parts of TPU

  • Tensors are matrices or multi-dimensional arrays. Tensors are fundamental units that may store data in a row and column format, such as the weights of a node in a neural network. Tensors are used to execute basic math operations such as addition, element-wise multiplication, and matrix multiplication.
  • Floating-point operations per second (FLOPs) are units of measurement for the speed of a computer operation. In the case of Google TPUs, the proprietary floating-point format is known as “Brain Floating Point Format,” or “bfloat16” for short. To speed up neural network training, bfloat16 is carefully inserted within systolic arrays.
  • A systolic array is a collection of processors that execute computations and distribute the results across the system. As seen above, it is made up of a huge number of processing elements (PEs) that are grouped in arrays. These arrays offer a high degree of parallelism, making them ideal for parallel processing.

How does it work?

The Tensor Processing Unit (TPU), a custom ASIC designed exclusively for machine learning and customized for TensorFlow, can do huge multiplications and adds for neural networks at high rates while consuming less power and taking up less space.

TPUs carry out three major tasks:

  • The parameters are first loaded from memory into the multiplier and adder matrix.
  • Data is then loaded from memory.
  • The results of each multiplication operation are passed on to the next multiplier while a summation (dot product) is performed at the same time. It may be seen in the animation above. The total of all multiplication outcomes between data and parameters is then presented as the output.

A common cloud TPU features two 128 × 128 systolic arrays in a single processor, aggregating 32,768 ALUs (Arithmetic Logic Units) for 16-bit floating-point values. Thousands of multipliers and adders are directly coupled to build a massive physical matrix of operators, forming a systolic array architecture as previously stated.

TPU enables the device to be more tolerant of lower computing accuracy, resulting in fewer transistors required per operation. A single-chip may handle more operations per second as a result of this capability.

TPUs may not be suited for managing other types of workloads because they were designed specifically for operations like matrix multiplications and training acceleration.

Good and bad sides of TPU

TPUs have a lot of advantages in terms of enhancing computing efficiency and speed, including the following:

  • The performance of linear algebra computations, which are widely employed in machine learning applications, has been improved.
  • When training big, sophisticated neural network models, the time to accuracy is reduced: models that previously took weeks to train may now converge in hours on TPUs.
  • With their TPU servers, they can scale processes over several computers.

TPUs are specially built for rapid, hefty matrix multiplication, which is crucial to understand. In workloads that aren’t dominated by matrix multiplication, cloud TPUs are likely to be outperformed by alternative platforms, such as:

  • Linear algebra programs that need a lot of branching or are dominated by algebra element-by-element;
  • Workloads that make sparse memory accesses;
  • Workloads that necessitate a high level of precision in arithmetic;
  • Custom TensorFlow operations implemented in C++ in neural network workloads, notably custom operations in the body of the main training loop.

The combination of TensorFlow with TPUs has the potential to revolutionize medicine, image processing, and machine learning. Machine learning becomes more competitive and accessible to the general public when the time it takes to train a model and compute it is decreased from weeks to hours.

Furthermore, providing TPUs as a Cloud service lets customers begin developing their models without having to pay any money upfront. Researchers, professionals, small enterprises, and even students may now easily begin machine learning initiatives.