Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
All the infrastructure exists so that a workload can ask for a GPU, and the way it asks is a single line under resources. But GPUs behave differently from CPU and memory in one critical way: they are not compressible or fractional by default — you request whole GPUs, and the scheduler hands you exclusive use of that device. Getting the request syntax right, and understanding that a request of 1 means one entire physical GPU, is the difference between a pod that schedules and one that sits Pending or, worse, silently runs on CPU. This is the most-used five lines of YAML in the entire course.
A GPU request goes under resources.limits with the key nvidia.com/gpu. The pod below asks for one GPU and runs a CUDA workload; the scheduler will only place it on a node with a free GPU.
Use these three in order. Each builds on the one before.
In one paragraph, how does a pod request a GPU in Kubernetes, and what does requesting '1' actually grant?
Walk me through what the scheduler does with a pod that has nvidia.com/gpu: 1 — how it finds a node and reserves the device.
Given that GPUs are scheduled as whole, exclusive devices, what scheduling and utilization problems does that create for small workloads, and which later techniques (MIG, time-slicing) address them?
# gpu-pod.yml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
resources:
limits:
nvidia.com/gpu: 1 # request exactly one whole GPU
# No tolerations here yet; on a tainted GPU node this pod would stay Pending
# until you add the toleration covered in Module 3.