AWS – Amazon EKS support in Amazon SageMaker HyperPod to scale foundation model development
We are excited to announce the general availability of Amazon EKS support in SageMaker HyperPod which enables customers to run and manage their Kubernetes workloads on SageMaker HyperPod, a purpose-built infrastructure for foundation model (FM) development which reduces time to train models by up to 40%.
Many customers use Kubernetes to orchestrate their ML workflows due to its portability, scalability, and rich ecosystem of tools. These customers want to continue using Kubernetes’ familiar interface, but still want an automated way to manage hardware failures. EKS support in HyperPod combines the benefits of SageMaker HyperPod offering self-healing performant clusters with the containerization capabilities of Amazon EKS, a managed Kubernetes service. With this launch, customers can run deep health checks during cluster creation to reduce failures during training. Further, HyperPod automatically replaces faulty nodes and resumes training from your last checkpoint on both AWS Trainium and Nvidia GPU at a scale of more than a thousand accelerators. Customers have the flexibility to use either the new HyperPod CLI, or their preferred tools, to submit, manage, and monitor workloads. The persistent cluster environment offers ssm access and the ability to customize the cluster. EKS orchestrated HyperPod clusters also integrate with CloudWatch Container Insights to provide out-of-the-box observability, by auto-discovering HyperPod node health status and visualizing them in curated dashboards.
This release is generally available in the AWS Regions where SageMaker HyperPod is available except Europe (London).
To learn more, see the following list of resources: Webpage, AWS News Blog, Documentation, Github repository.
Read More for the details.