Skip to main content

One post tagged with "ml-training"

View All Tags

Kubernetes v1.36 - Mutable Pod Resources for Suspended Jobs

· 4 min read
Guille Vigil
Platform Engineer

Batch and ML workloads often need resources that aren't known at Job creation time. The optimal allocation depends on cluster capacity, queue priorities, and hardware availability — all of which can change between submission and execution.

Kubernetes v1.36 promotes a feature to beta that solves this: mutable pod resources for suspended Jobs.

Before this, resource requirements in a Job's pod template were immutable. If a queue controller determined that a suspended Job should run with different resources, the only option was to delete and recreate the Job — losing metadata, status, and history.