“Multislice training has been a game-changer. It’s made it easy to scale our ML workloads beyond a single densely-interconnected slice using data-center networking. JAX XLA made it easy to set up and delivered high performance out-of-the-box.”—Myle Ott, Co-founder, Character AI
Multislice supports JAX and PyTorch frameworks. For fast out-of-the box performance, in addition to compiler support for all models, we provide MaxText and PAX for LLMs, as open-sourced and well-tested examples written in pure Python and JAX that can be used as starter code. PAX is a framework for training large-scale models that allows for advanced and fully configurable experimentation and parallelization, and has demonstrated industry-leading MFU rates. MaxText is a more minimal framework intended for forking and adaptation. The only code change compared to single-slice code is the extra sharding dimension for DCN parallelism.
High performance networking
Multislice supports AllReduce, Broadcast, Reduce, AllGather and ReduceScatter collective communication operations over Google’s Jupiter data center network. As reported in August 2022, Jupiter reduces flow completion by 10%, improves throughput by 30%, uses 40% less power, incurs 30% less capex costs, and delivers 50x less downtime than previous generations of the Google data center network.3
Easy to manage
There are two options to manage the Multislice job: using Compute Engine Queued Resource CLIs and APIs or through Google Kubernetes Engine (GKE).
Special options allow for one-step deletion and creation of the collection of slices. And, fast recovery means jobs are restarted quickly even when individual slices are interrupted.
Reliable and fault tolerant
Your model training jobs restart automatically from the previous checkpoint even if individual slices fail. Using Multislice with GKE further improves the failure recovery experience — a single field-change in the yaml file implements automatic retry on encountering errors.
“Google Cloud’s TPU Multislice provided significant productivity and efficiency gains for us right out-of-the-box, enabling us to scale our language model training reliably. We recommend Multislice to anyone building large generative language AI models.”—Emad Mostaque, CEO, Stability AI
Get started
Multislice was designed to enable efficient large-scale AI model training. To scale AI workloads, hardware and software must work in concert. We have kept AI development productivity top of mind and are excited for you to try Multislice in preview on both Cloud TPU v4 as well as on the newly-announced Cloud TPU v5e.
Please contact your Google Cloud account representative to learn more and try Cloud TPU with Multislice using PAX and MaxText.
1. Google internal data as of August, 2023
2. Google internal data as of August, 2023
3. Google internal data as of August, 2023