Overcoming the Deployment Challenges of H100 GPUs in Azure Kubernetes

Dec 21, 2023 • 5 min

In the ever-evolving world of cloud computing, NVIDIA H100 GPUs have made a name for themselves with their stellar performance, driving significant strides in AI and intensive computing. Recently made available by most cloud providers, including Azure, we got the chance to integrate them into an Azure Kubernetes Service (AKS) cluster for a heavyweight in Generative AI (LLM).

Our original deployment plan, which worked with the A100 generation, never led to a functional cluster. We encountered a problem when launching instances of type Standard_ND96isr_H100_v5 (integrating 8 NVIDIA H100 GPUs) on Ubuntu 22.04, within a Kubernetes 1.27.7 cluster.

So, let’s dive into our journey!

The Challenge

We leveraged the Nodepool feature to scale the Kubernetes cluster capacity as needed. The AKS Nodepools rely on Virtual Machine Scale Sets (VMSS), which decide whether to add or remove instances.

Despite our setup, we never managed to get an instance in a proper state, let alone recognized as a node in the Kubernetes API. The process of spinning up a new virtual machine with H100 GPUs simply never concluded successfully.

Digging deeper, we discovered that a file was missing during the GPU initialization: /usr/share/nvidia/nvswitch/dgxh100_hgxh100_topology. This missing file caused the systemd service nvidia-fabric-manager to fail, preventing the complete initialization of the instance. Consequently, it never got recognized as a Node in the Kubernetes API.

Further analysis revealed that the images prepared by Azure for AKS included a simplified installation of GPU drivers (version 525.85.12 to date). This makes sense considering the massive size of complete NVIDIA installations, whereas not everyone uses GPUs on AKS.

Unfortunately, where A100s thrived, H100s failed!

The Quick Hack

To swiftly address the issue, we tried a manual intervention right after an instance started… by simply inserting the missing file!

Spoiler alert: the rest of the initialization went smoothly, and H100 Kubernetes nodes finally appeared.

Here’s how we did it:

TMP_NVIDIA=$(mktemp -d)
(
  cd $TMP_NVIDIA
  curl -s -o nvidia-fabricmanager.deb http://azure.archive.ubuntu.com/ubuntu/pool/multiverse/f/fabric-manager-525/nvidia-fabricmanager-525_525.147.05-0ubuntu0.22.04.1_amd64.deb
  ar -x nvidia-fabricmanager.deb data.tar.zst
  tar xf data.tar.zst -C / ./usr/share/nvidia/nvswitch/dgxh100_hgxh100_topology
)
rm -Rf $TMP_NVIDIA

Note that we used a different version of the driver to extract the file. We had no other choice since the Ubuntu package registries on the machine didn’t offer an exact match.

But rest assured, we verified (through checksum) that the file was identical across versions.

Non-Automatable

While it would be easy to adapt an image or customize its startup using Custom Script Extensions, sadly, AKS does not give you full control. The customization options are quite limited.

Azure’s Suggested Solution

So, we reached out to our contact at Azure for a solution.

The adopted strategy came from a somewhat obscure article about the A100 but still shed light on some specifics regarding the H100. It became clear we had to override the default behavior and install the NVIDIA driver later on.

Disabling Default Installation

Firstly, we needed to stop the default behavior of installing drivers during instance startup.

This was done using the SkipGPUDriverInstall tag.

az aks nodepool add \
  --resource-group <AKS_RG> --cluster-name <AKS_NAME> \
  --name <NODEPOOL_NAME> --node-count 1 \
  --os-sku Ubuntu --node-osdisk-size 128 \
  --node-vm-size Standard_ND96isr_H100_v5 \
  --tags SkipGPUDriverInstall=true

NVIDIA’s gpu-operator

This is a Kubernetes operator, one among many, designed to handle complex tasks on Kubernetes.

In short, once a new node is added to Kubernetes, the operator sequentially:

  • Detects potential GPUs installed on the node
  • Checks for the presence of NVIDIA drivers
  • Installs a recent version if not present
  • Sets up a bunch of configurations

All this through multiple reconciliation loops and node labeling.

I must admit, this complex system is fascinating, appearing quite complicated at first glance.

Well, at this stage … it seems to work!

Version Incompatibility

But where are my 8 GPUs per instance???

status:
  allocatable:
    nvidia.com/gpu: "5"

Despite numerous tests, nvidia-smi erraticly displayed 5, 6, or 7 GPUs but never a total of 8.

The culprit was the operator itself, which in its 23.9.0 version, installed drivers 535.104.12. It did not work either while enforcing version 535.129.03. Instead of dedicating time (we did not have) to further investigation, we switched back to a gpu-operator 23.6.1, forcing version 535.86.10.

And then, everything was fine.

Was it ?

Boot Time

Using the gpu-operator results in a significant delay before a Kubernetes node becomes useful. Some tasks take a substantial amount of time (in descending order):

  • Driver compilation
  • Driver download
  • Container image downloads (there are quite a few)
  • Hardware detection on nodes
  • Reconfiguring containerd to support GPUs

This delay is unfortunate given the cost of such equipment by the minute. Even more so when considering the cluster autoscaler, which doesn’t react well to pods not being scheduled quickly on the node. Specifically, the nvidia.com/gpu resources associated with the node only appear once the process is complete, after a good 6 minutes!

Ah, autoscaling, a whole topic in itself… deserving its own blog post :)

Precompiled Drivers

In some cases, the gpu-operator can use precompiled drivers available for your instance to speed up the process. The list is available here, but unfortunately, you can only choose the major version, e.g., 535-5.15.0-1053-azure-ubuntu22.04

Conclusion

In conclusion, the situation is rather tricky at the moment.

We hope Azure’s teams will integrate a corrected version (not necessarily more recent) of drivers into their base image. This will enable quick deployment of H100 GPU nodes in an AKS cluster, without the need for a complicated setup. Or perhaps they’ll allow more detailed customization of instances launched by AKS.

I’d like to thank Aurélien, Thibault, and Romain, my “Monkey” colleagues at Enix. They rolled up their sleeves like never before to meet tight deadlines.

Also, a shoutout to our Azure contact, who’ll recognize himself, for his valuable insights during the problematic phase.

Terraform Bonus

As a side note:

We use Terraform to set up and maintain the AKS cluster. We particularly wanted to use the GPU Preview images available via Custom Headers. Unfortunately, this isn’t possible when using Terraform here, as the issue lies more on the Azure API side here!

Alright, that’s all for now, ciao!

Update

We received a feedback from Microsoft. There is an AKS image patch scheduled for may this year (2024).


Do not miss our latest DevOps and Cloud Native blogposts! Follow Enix on Twitter!