GCP – Announcing new capabilities in Vertex AI Training for large-scale training
Building and scaling generative AI models demands enormous resources, but this process can get tedious. Developers wrestle with managing job queues, provisioning clusters, and resolving dependencies just to ensure consistent results. This infrastructure overhead, along with the difficulty of discovering the optimal training recipe and navigating the endless maze of hyperparameter and model architecture choices, slows the path to production-grade model training.
Today, we’re announcing expanded capabilities in Vertex AI Training that simplify and accelerate the path to developing large, highly differentiated models.
Our new managed training features, aimed at developers training with hundreds to thousands of AI accelerators, builds on the best of Google Cloud’s AI infrastructure offerings, including Cluster Director for a fully managed and resilient Slurm environment, and adds sophisticated management tools. This includes pre-built data science tooling and optimized recipes integrated with frameworks like NVIDIA NeMo for specialized, massive-scale model building.
Built for customization and scale
Vertex AI Training delivers choice across the full spectrum of model customization. This range extends from cost-effective, lightweight tunings like LoRA for rapid behavioral refinement of models like Gemini, all the way to large-scale training of open-source or custom-built models on clusters for full domain specialization.
The Vertex AI training capabilities are organized around three areas:
1. Flexible, self-healing infrastructure
With Vertex AI Training, you can create a production-ready environment in minutes. By leveraging our included Cluster Director capabilities, customers benefit from a fully managed and resilient Slurm environment that simplifies large scale training.
Automated resiliency features proactively check for and avoid stragglers, swiftly restart or replace faulty nodes, and utilize performance-optimized checkpointing functionality to maximize cluster uptime.
To achieve optimal cost efficiency, you can provision Google Cloud capacity using our Dynamic Workload Scheduler (DWS). Calendar Mode provides fixed, future-dated reservations (up to 90 days), similar to a scheduled booking. Flex-start provides flexible, on-demand capacity requests (up to 7 days) that are fulfilled as soon as all requested resources become simultaneously available.
2. Comprehensive data science tooling
Our comprehensive data science tooling removes much of the guesswork from complex model development. It includes capabilities such as hyperparameter tuning (which automatically finds the best model settings), data optimization, and advanced model evaluation – all designed to ensure your specialized models are production-ready faster.
3. Integrated recipes and frameworks
Maximize training efficiency out-of-the-box with our curated, optimized recipes for the full model development lifecycle, from pre-training and continued pre-training to supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). We also provide seamless integration of standardized frameworks like NVIDIA NeMo and NeMo-RL.
How customers are seeing impact with Vertex AI Training
Salesforce: The Salesforce AI Research team leveraged Vertex AI Training to expand the capabilities of their large action models. By fine-tuning these models for their unique business operations, Salesforce’s Gemini models now outperform industry-leading LLMs against key CRM benchmarks. This allows customers to more accurately and reliably automate complex, multi-step business processes, providing the reliable foundation for building AI agents.
-
“In the enterprise environment, it’s imperative for AI agents to be highly capable and highly consistent, especially for critical use cases. Together with Google Cloud, we are setting a new standard for building the future of what’s possible in the agentic enterprise down to the model level.” – Silvio Savarese, Chief Scientist at Salesforce
AI Singapore (AISG): AISG utilized Vertex AI Training’s managed training capabilities on reserved clusters to launch their 27-billion parameter flagship model. This extensive specialization project demanded peak infrastructure reliability and performance tuning to achieve precise language and contextual customization for diverse Southeast Asian markets.
-
“AI Singapore recently launched SEA-LION v4, an open source foundational model incorporating Southeast Asian contexts and languages. Vertex AI and its managed training clusters were instrumental in our development of SEA-LION v4. Vertex AI delivered a stable, resilient environment for our large scale training workloads that was easy to set up and use. Its optimized training recipes helped increase training throughput performance by nearly 30%.” – William Tjhi, Head of Applied Research, AI Products Pillar, AI Singapore
Looking for more control?
For customers seeking maximum flexibility and control, our AI-optimized infrastructure is available via Google Compute Engine or through Google Kubernetes Engine, both of which include Cluster Director to provision and manage highly scalable AI training accelerators and clusters. Cluster Director provides the deep control over hardware, network optimization, capacity management, and operational efficiency that these advanced users demand.
Elevate your models today
Vertex AI Training provides the full range of approaches, the world-class infrastructure, and the expertise to make your AI your most powerful competitive asset. Interested customers should contact their Google Cloud sales representative for access and to gain access and learn more about how Vertex AI Training can help deliver their unique business advantage.
Read More for the details.
