AWS – Amazon SageMaker HyperPod announces health monitoring agent support for Slurm clusters
Today, Amazon SageMaker HyperPod announces the general availability of the health monitoring agent for Slurm clusters. SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). The health monitoring agent performs passive, background health checks of instances to identify problems in key areas without impact on application behavior or performance, flags failures instantly, and replaces any unhealthy instances to keep your training jobs running smoothly.
The agent runs continuously on all GPU- or Trainium-based nodes in your HyperPod cluster, watching for hardware issues such as unresponsive GPUs or NVLink error counters. When a fault is detected, it marks the node as unhealthy and automatically reboots or replaces it with a healthy node, keeping your jobs running without requiring manual intervention. The agent also follows a co-ordinated approach to handling failures with the job auto-resume functionality available with Slurm clusters. For example, jobs with auto-resume enabled will continue from the last saved checkpoint once nodes are replaced by the agent. This hands-free recovery—already available on HyperPod clusters orchestrated with Amazon EKS—now gives Slurm clusters the same resilient environment, helping teams train large models for weeks without disruption and reclaim time and costs that would otherwise be lost to mid-run failures. In addition, customers can now also reboot their nodes using a simple command in case of intermittent issues such as GPU driver issues requiring reset.
Health monitoring agent for Slurm is available in all regions where HyperPod is generally available. The agent is auto-enabled on all newly created Slurm clusters; to enable it on an existing cluster, simply upgrade to the latest HyperPod AMI by calling the UpdateClusterSoftware API. To learn more, visit the Amazon SageMaker HyperPod documentation.
Read More for the details.