GCP – Train AI for less: Improve ML Goodput with elastic training and optimized checkpointing
Want to save some money on large AI training? For a typical PyTorch LLM training workload that spans thousands of accelerators for several weeks, a 1% improvement in ML Goodput can translate to more than a million dollars in cost savings1. Therefore, improving ML Goodput is an important goal for model training — both from an efficiency perspective, as well as for model iteration velocity.
However, there are several challenges to improving ML Goodput today: frequent interruptions that necessitate restarts from the latest checkpoint, slow inline checkpointing that interrupts training, and limited observability that makes it difficult to detect failures. These issues contribute to a significant increase in the time-to-market (TTM) and cost-to-train. There have been several industry publications articulating these issues, e.g., this Arxiv paper.
Improving ML Goodput
In order to improve ML Goodput, you need to minimize the impact of disruptive events on the progress of the training workload. To resume a job quickly, you can automatically scale down the job, or swap failed resources from spare capacity. At Google Cloud, we call this elastic training. Further, you can reduce workload interruptions during checkpointing and speed up checkpoint loads on failures from the nearest available storage location. We call these capabilities asynchronous checkpointing and multi-tier checkpointing.
The following picture illustrates how these techniques provide an end-to-end remediation workflow to improve ML Goodput for training. An example workload of nine nodes is depicted with three-way data parallelism (DP) and three-way pipeline parallelism (PP), with various remediation actions shown based on the failures and spare capacity.
You can customize the remediation policy for your specific workload. For example, you can choose between a hotswap and a scaling-down remediation strategy, or to configure checkpointing frequency, etc. A supervisor process receives failure, degradation, and straggler signals from a diagnostic service. The supervisor uses the policy to manage these events. In case of correctable errors, the supervisor might request an in-job restart, potentially restoring from a local checkpoint. For uncorrectable hardware failures, a hot swap can replace the faulty node, potentially restoring from a peer checkpoint. If no spare resources are available, the system can scale down. These mechanisms ensure training is more resilient and adaptable to resource changes. When a replacement node is available, training scales up automatically to maximize GPU utilization. During scale down and scale up, user-defined callbacks help adjust hyperparameters such as learning rate and batch size. You can set remediation policies using a Python script.
Let’s take a deeper look at the key techniques you can use when optimizing ML Goodput.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e612fbff4f0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Elastic training
Elastic training enhances the resiliency of LLM training by enabling failure sensing and mitigation capabilities for workloads. This allows jobs to automatically continue with remediation strategies including GPU reset, node hot swap, and scaling down the data-parallel dimension of a workload to avoid using faulty nodes, thereby reducing job interruption time and improving ML Goodput. Furthermore, elastic training enables automatic scaling up of data-parallel replicas when replacement nodes become available, maximizing training throughput.
Watch this short video to see elastic training techniques in action:
Optimized checkpointing
Sub-optimal checkpointing can lead to unnecessary overhead during training and significant loss of training productivity when interruptions occur and previous checkpoints are restored. You can substantially reduce these impacts by defining a dedicated asynchronous checkpointing process and optimizing it to quickly offload the training state from GPU high-bandwidth memory to host memory. Tuning the checkpoint frequency — based on factors such as the job interruption rate and the asynchronous overhead — is vital, as the best interval may range from several hours to mere minutes, depending on the workload and cluster size. An optimal checkpoint frequency minimizes both checkpoint overhead during training operation and computational loss during unexpected interruptions.
A robust way to meet the demands of frequent checkpointing is to leverage three levels of storage: local node storage, e.g., local SSD; peer node storage in the same cluster; and Google Cloud Storage. This multi-tiered checkpointing approach automatically replicates data across these storage tiers during save and restore operations via the host network interface or NCCL (the NVIDIA Collective Communications Library), allowing the system to use the fastest accessible storage option. By combining asynchronous checkpointing with a multi-tier storage strategy, you can achieve quicker recovery times and more resilient training workflows while maintaining high productivity and minimizing the loss of computational progress.
Watch this short video to see optimized checkpointing techniques in action :
These ML Goodput improvement techniques leverage NVIDIA Resiliency Extension, which provides failure signaling and in-job restart capabilities, as well as recent improvements to PyTorch’s distributed checkpointing, which support several of the previously mentioned checkpoint-related optimizations. Further, these capabilities are integrated with Google Kubernetes Engine (GKE) and the NVIDIA NeMo training framework, pre-packaged into a container image and available with an ML Goodput optimization recipe for easy deployment.
Elastic training in action
In a recent internal case study with 1,024 A3 Mega GPU-accelerated instances (built on NVIDIA Hopper), workload ML Goodput improved from 80%+ to 90%+ using a combination of these techniques. While every workload may not benefit in the same way, this table shows the specific metric improvements and ML Goodput contribution of each of the techniques.
Example: Case study experiment used an A3 Mega cluster with 1024 GPUs running ~40hr jobs with ~5 simulated interruptions per day
Conclusion
In summary, elastic training and optimized checkpointing, along with easy deployment options, are key strategies to maximize ML Goodput for large PyTorch Training workloads. As seen from the case study above, they can contribute to meaningful ML Goodput improvements and provide significant efficiency savings. These capabilities are customizable and composable through a python script. If you’re running PyTorch GPU training workloads on Google Cloud today, we encourage you to try out our ML Goodput optimization recipe, which provides a starting point with recommended configurations for elastic training and checkpointing. We hope you have fun building and share your feedback!
Various teams and individuals within Google Cloud contributed to this effort. Special thanks to – Jingxin Ye, Nicolas Grande, Gerson Kroiz, and Slava Kovalevskyi, as well as our collaborative partners – Jarek Kazmierczak, David Soto, Dmitry Kakurin, Matthew Cary, Nilay Goyal and Parmita Mehta for their immense contributions to developing all of the components that made this project a success.
1. Assuming A3 Ultra pricing for 20,000 GPUs with jobs spanning 8 weeks or longer
Read More for the details.