AWS – Amazon S3 Connector for PyTorch now supports Distributed Checkpoint
Amazon S3 Connector for PyTorch now supports Distributed Checkpoint (DCP), improving the time to write checkpoints to Amazon S3. DCP is a PyTorch feature for saving and loading machine learning (ML) models from multiple training processes in parallel. PyTorch is an open source ML framework used to build and train ML models.
Distributed training jobs often run for several hours or even days, and checkpoints are written frequently to improve fault tolerance. For example, jobs training large foundation models often run for several days and generate checkpoints that are hundreds of gigabytes in size. Using DCP with Amazon S3 Connector for PyTorch helps you reduce the time to write these large checkpoints to Amazon S3, keeping your compute resources utilized, ultimately resulting in lower compute cost.
Amazon S3 Connector for PyTorch is an open source project. To get started, visit the GitHub page.
Read More for the details.