Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
For a container to see a GPU, the container runtime itself must be the NVIDIA runtime, which hooks into container startup to mount the driver libraries and device files. You select that runtime either by making it the node default or, more explicitly, via a RuntimeClass that pods reference by name. Understanding RuntimeClass matters because mixed clusters often want NVIDIA runtime only for GPU pods, and because 'the GPU isn't injected' frequently traces back to the wrong runtime being used. This is the layer between the device plugin advertising a GPU and the container actually receiving it.
A RuntimeClass named nvidia points at the NVIDIA container runtime handler. Pods that set runtimeClassName: nvidia are started with that runtime, which injects the GPU; pods that don't, aren't.
Use these three in order. Each builds on the one before.
In one paragraph, what is a RuntimeClass and why do GPU pods sometimes need runtimeClassName: nvidia?
Walk me through how the NVIDIA container runtime hooks into container startup to mount driver libraries and device nodes into the container.
In a mixed cluster where only some nodes should use the NVIDIA runtime, compare making it the node default vs requiring runtimeClassName per pod, and the failure modes of each.
# Define a RuntimeClass that maps to the NVIDIA runtime handler.
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia # matches the containerd runtime handler name
---
# A pod that explicitly opts into the NVIDIA runtime.
apiVersion: v1
kind: Pod
metadata:
name: gpu-with-runtimeclass
spec:
runtimeClassName: nvidia
containers:
- name: app
image: nvcr.io/nvidia/cuda:12.5.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1