March 1st 2022 • 5 minute read

How to Speed Up Neural Network Training with Intel's Gaudi HPUs. A Tensorflow 2.0 Object Detection Example

You have heard of GPUs, and TPUs. Here is what you need to know as an ML engineer to use HPUs (Habana Processor Units)!

PDF to images with Ghostcript, command line.

Move over TPUs, say hello to HPUs - Habana Processing Units, from Habana Labs!

TLDR;

Habana is an Intel company and they have created the Habana Gaudi platform (hardware accelerator and associated software stack) optimized to speed up neural network computation. Features of the Habana Gaudi accelerator (e.g., heterogenous compute architecture, programmable tensor processing cores, software managed HBM2 memory, integrated 10x100GB ethernet RoCE) enables improved efficiency during model training with the promise of `traning more and spending less`. See results from their MLPerf Benchmarks
Habana offers the SynapseAI software stack which aims to provide a good developer experience optimized for Tensorflow and Pytorch. IMHO, the software stack is still in development. As of time of writing (March 2022) your Tensorflow/Pytorch code will not be automatically accelerated as is. A few (sometimes non-trivial) changes need to be made to run your code on a single Gaudi card, more in the text below; even more work to enable distributed training on multiple cards within the same machine and across multiple machines. However the software stack is clearly under active development and I expect it to evolve to the point where migration is effortless.
Habana offers multiple paths to distributed training such as support for Horovod and tf.distribute with a HPU strategy in Tensorflow and torch.distributed in Pytorch. The tf.distribute approach in Tensorflow appears to have more straightforward developer experience.
Don't want to be bogged down installing the devices, managing drivers etc? Habana devices are available as AWS EC2 DL1 instances with docker images corresponding to the right framework. This made it truly easy to get started in minutes!

What do GPUs, TPUs and HPUs have in common? They are all hardware accelerators - multicore processors that can be utilized (with special software) to speed up the execution of programs (in this case tensor computation needed for training neural networks). In many cases, the accelerators can be chained together (with some small cordination overhead) where programs are distributed across accelerators in parrallel, resulting in faster training.

Over the last few years, there has been intense research and development in this area with emerging standards and platforms which you might already be familiar with e.g., NVIDIA's GPU series and software, Google's TPUs. More recently, Intel has released the Gaudi platform architecture - and this post describes my experience trying it out.

What is the Habana Gaudi Processor?

Gaudi Architecture and Software Overview

The Habana® Gaudi® processor is designed to maximize training throughput and efficiency, while providing developers with optimized software and tools that scale to many workloads and systems. Habana Gaudi software was developed with the end-user in mind, providing versatility and ease of programming to address the unique needs of users’ proprietary models, while allowing for a simple and seamless transition of their existing models over to Gaudi. Gaudi is designed from the ground up for accelerating DL training workloads. Its heterogeneous architecture comprises a cluster of fully programmable Tensor Processing Cores (TPC) along with its associated development tools and libraries, and a configurable Matrix Math engine. The TPC core is a VLIW SIMD processor with instruction set and hardware tailored to serve training workloads efficiently. It is programmable, providing the user with maximum flexibility to innovate, coupled with many workload-oriented features, such as: GEMM operation acceleration, Tensor addressing, Latency hiding capabilities, Random number generation, and Advanced implementation of special functions.

The TPC core natively supports the following data types: FP32, BF16, INT32, INT16, INT8, UINT32, UINT16 and UINT8.

The Gaudi memory architecture includes on-die SRAM and local memories in each TPC. In addition, the chip package integrates four HBM devices, providing 32 GB of capacity and 1 TB/s bandwidth. The PCIe interface provides a host interface and supports both generation 3.0 and 4.0 modes.

Gaudi is the first DL training processor that has integrated RDMA over Converged Ethernet (RoCE v2) engines on-chip. With bi-directional throughput of up to 2 TB/s, these engines play a critical role in the inter-processor communication needed during the training process. This native integration of RoCE allows customers to use the same scaling technology, both inside the server and rack (scale-up), as well as to scale across racks (scale-out). These can be connected directly between Gaudi processors, or through any number of standard Ethernet switches.

The Habana Developer Experience. Is it Worth it?

As an ML engineer, you are probably interested in approaches that save you time and money while running ML experiments. At the same time, you want to spend the bulk of your time iterating on your model as opposed to debugging the tool/platform/framework.

Habana recognizes this need and has done a few things IMO that address this:

Standardized on docker containers for delivering sdk updates: This way, you avoid all the issues with installing drivers, libraries matching version etc.
Partnering with cloud solution providers to make hardware setup effortless. (currently AWS).
Provide the SynapseAI SDK for efficient execution of neural network topologies on Gaudi hardware. The SynapseAI TensorFlow/PyTorch bridge identifies the subset of the framework’s computation graph that can be accelerated by Gaudi. For performance optimization, the compilation recipe is cached for future use. Operators that are not supported by Gaudi are executed on the CPU. Habana Gaudi also provides templates for developing custom kernels which can then be used in custom ops within your favorite framework.
Provide a growing list of reference models and training implementation code to illustrate best practices
Provide multiple methods to achieve distributed training.
- TensorFlow: Gaudi scaling with data parallelism in the TensorFlow framework is achieved using two distinct methods - Using Habana Horovod and HPUStrategy integrated with tf.distribute API.
- PyTorch: Gaudi scaling with data parallelism in the PyTorch framework is achieved using torch.distributed package using DDP - Distributed Data Parallel. DDP is a widely adopted single-program multiple-data training paradigm. With DDP, the model is replicated on every process, and every model replica will be fed with a different set of input data samples. DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

AMazon DL1 Instance

Gaudi-based Amazon EC2 DL1 Training Instances

Habana Gaudi-based Amazon EC2 DL1 Training Instances feature up to 8 Gaudi accelerators and deliver up to 40% better price performance than current generation GPU-based EC2 instances. These new high-efficiency instances feature 32GB of high bandwidth memory (HBM) per accelerator, 768 GiB of system memory, custom 2nd Generation Intel® Xeon® Scalable Processors, and 4 TB of local NVMe storage. In addition, DL1 instances provide 400 Gbps of networking throughput and feature all-to-all connectivity within DL1 servers with Gaudi’s native integration of ten 100 Gigabit ports of RDMA over Converged Ethernet.

To get started, here are two simple steps:

Setup and launch a Habana base AMI on AWS. This lets you launch an EC2 DL1 large instance (8 Habana accelerator cards).

Hint: For users new to AWS, remember to allocate a significant amount of disc space to your Habana DL1 instance; the default disc space is 8GB which is insufficient for most scenarios. Also spend some time getting familiar with Docker.

Pull and Run the Habana Docker Image. While the DL1 instance, provides access to the Habana Gaudi accelerator cards, the docker images provided by Habana set up the right software. Select the Tensorflow docker image which is what we will be using.

Follow instructions on the installer page to run the container. The rest of the steps must be run in the docker container.

Training an Object Detection Model on Habana Gaudi with Tensorflow

Habana labs provide reference implementations of many computer vision tasks in both the Pytorch and Tensorflow frameworks. These references are valuable as they already take care of the hard work required to ensure most of the ML computations (Ops) in both frameworks efficiently run on the HPU cards. For this experiment, we will use the RetinaNet Tensorflow Implementation.

In the Habana Tensorflow container, clone the RetinaNet model and install its requirements.

Also, navigate to the /official/vision/beta folder. This folder contains the primary train.py script which we will reference later. We will also download the dataset to this directory and assume this as the location of the dataset.

Download the SignverOD Dataset from Kaggle

We will download the SignverOD dataset which is made freely available on Kaggle. Note that you will need a kaggle account to use the api. Alternatively, you can manually download the dataset and copy it to your habana docker instance.

bash

pip install kaggle

setup your kaggle username and token in the environment

bash

export KAGGLE_USERNAME=yourusername
export KAGGLE_KEY=xxxxxxxxxxxxxx

download the dataset

bash

kaggle datasets download -d victordibia/signverod
unzip signverod.zip

This will download the SignverOD dataset which contains a tfrecords folder which train and eval data shards.

Train RetinaNet on a Single Gaudi Card

The following script will train the retinanet model on a single Gaudi card

bash

python3 train.py --experiment=retinanet_resnetfpn_coco --model_dir=output --mode=train --config_file=configs/experiments/retinanet/config_beta_retinanet_1_hpu_batch_8.yaml --params_override="{task: {init_checkpoint: gs://cloud-tpu-checkpoints/vision-2.0/resnet50_imagenet/ckpt-28080, train_data:{input_path: tfrecords/train*}, validation_data: {input_path: tfrecords/eval*} }}"

Train RetinaNet on 8 Gaudi Cards

The following script will train the retinanet model on all 8 Gaudi cards available in the DL1 instance

bash

mpirun --allow-run-as-root --tag-output --merge-stderr-to-stdout --output-filename /root/tmp/retinanet_log --bind-to core --map-by socket:PE=4 -np 8 python3 train.py --experiment=retinanet_resnetfpn_coco --model_dir=~/tmp/retina_model --mode=train_and_eval --config_file=configs/experiments/retinanet/config_beta_retinanet_8_hpu_batch_64.yaml --params_override="{task: {init_checkpoint: gs://cloud-tpu-checkpoints/vision-2.0/resnet50_imagenet/ckpt-28080, train_data:{input_path: tfrecords/train*}, validation_data: {input_path: tfrecords/eval*} }}"

The resulting trained model and training params (e.g., for visualization in Tensorboard) are in the output directory and can then be copied, exported as saved_models and used with the SignVer library.

Conclusion

In my experience, it was relatively easy train an object detection model using my own custom data and a reference model provided by Habana. I also experimented with training a custom Tensorflow 2.0 keras model (convolutional autoencoder) which was fairly easy - all that needs to be done is import the load_habana_module() and all supported Ops are run on a single HPU card.

Single and Distributed Training via tf.distribute.Strategy

python

from habana_frameworks.tensorflow import load_habana_module
load_habana_module()

Attempting to use mutliple cards simultaneously is more involved and requires some additional work. If you are using Tensorflow, the tf.distribute HPUStrategy appears to offer the best developer experience so far.

python

from habana_frameworks.tensorflow.distribute import HPUStrategy

strategy = HPUStrategy()

# For use with Keras
with strategy.scope():
   model = ...
   model.compile(...)

model.fit(...)

Thoughts on Improving the Developer Experience

IMO, one way to further improve the developer experience is to explore an SDK design (and corresponding examples) that demonstrate how the developer can prototype their training scripts locally on a CPU without errors, and the same code automatically works in an environment with Gaudi cards. E.g., the habana_frameworks library can be installed on a CPU and supports CPU abstractions when a HPU is unavailable. I understand that this is non-trivial, but with the right set of checks, error messages and examples, it could be game a changer especially for onboarding new developers. This design could provide developers freedom to carefully port and test their code locally (minimizing costs!) and then run training on the DL1 instance.

The space of hardware accelerators for machine learning is growing rapidly. IMO developer SDKs with good abstractions and good developer UX will definitely shape the future of this space.

Interested in more articles like this? Subscribe to get a monthly roundup of new posts and other interesting ideas at the intersection of Applied AI and HCI.

Read and Subscribe

← Previous