GCP – Anyscale powers AI compute for any workload using Google Compute Engine
Over the past decade, AI has evolved at a breakneck pace, turning from a futuristic dream into a tool now accessible to everyone. One of the technologies that opened up this new era of AI was Ray.
As the open-source AI Compute Engine, Ray has made it easier for developers to scale the most complex workloads such as multimodal data processing, model training, and inference across traditional and generative AI. Developed by Robert Nishihara, Philipp Moritz, and Ion Stoica at UC Berkeley’s RISELab in 2016, Ray powers AI and machine learning workloads and platforms for companies such as Netflix, Uber, RunwayML, and OpenAI.
In conversations across our community, we often see how solutions evolve incrementally to meet business demands — starting from data analytics with frameworks like Spark on traditional infrastructure (CPUs), progressing into machine learning and deep learning with GPUs and frameworks like PyTorch and TensorFlow, and now rapidly expanding into generative AI and the introduction of TPUs.
This incremental, bottom-up approach has led organizations into the AI Complexity Wall: fragmented infrastructure, countless accelerators, proliferating models, dozens of different frameworks, complex multimodal data processing, and a need to scale. Without a unified and optimized infrastructure, complexity quickly spirals into excessive cloud spending, resource inefficiencies, and productivity bottlenecks.
To solve this, Ion and his two Ph.D. students founded Anyscale, building upon Ray to offer a secure, scalable, cost-efficient, reliable, and optimized Unified AI Platform. Anyscale simplifies AI complexity, deployable in your environment or hosted by us, empowering teams—from a single laptop to thousands of GPUs—to accelerate AI model training and deployment.
Our motto is clear: “Any accelerator, any stack, any data, any model, any scale.”
And the results are powerful: Canva boosted machine utilization to almost 100% for distributed model training and lowered cloud costs by 50%, RunwayML is using massive video datasets for their multimodal foundation model Gen3-Alpha, Recursion processes 180 million images 7X faster, and Attentive lowered costs a staggering 99% while using 12X more data to create better models for their AI powered marketing platform.
To deliver these results, we needed extremely agile, efficient, cost-effective, and performant infrastructure — something that’s become even clearer in the opening weeks of this year amid some of the recent breakthroughs gen AI providers have unlocked.
We knew early on that the shape of infrastructure would be crucial, and that we needed flexibility from our cloud provider on this front. To meet our customers’ needs, we needed a cloud provider who could meet our own complex and demanding requirements.
And, as a compute orchestration platform that works across most of the major cloud services, we have some perspective on what makes each one unique.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3afb3bc040>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Flexible compute options for efficiency
Organizations are undertaking increasingly sophisticated AI projects—like orchestrating multiple LLMs (and non LLMs), across diverse frameworks, to deliver compound AI and agentic applications. AI workloads now span data processing, training, tuning, inference, and serving with models of widely varying sizes and end application user requirements.
To deliver these AI applications in production, our customers need to be as efficient as possible in their GPU utilization and cloud expenditures while meeting these requirements. While every cloud provider handles infrastructure a little bit differently, one of the things that makes Google Cloud different is how it handles the shape of infrastructure. That means hundreds of different instance types, each offering flexibility to choose from a wide range of GPUs and TPUs in differing amounts.
With Anyscale, customers can unlock this flexibility to optimize and run each model and workload on the most efficient hardware for the task. A single cluster can be made up of hundreds of machines, some with TPUs, some GPUs, to meet specific application requirements while lowering costs.
Plus, Anyscale supports leveraging Spot, On-Demand, or fixed Capacity Reservations as it runs AI workloads – optimizing for price, availability, and efficiency. And it dynamically updates clusters to optimize for utilization. If you’ve got two nodes running at around 40%, nodes will dynamically adjust and scale down with zero interruption.
Leading Performance
Performant infrastructure can be the difference between getting to production or not. Working with Compute Engine, we can launch clusters of hundreds of nodes in less than 60 seconds. That fast launch time coupled with Ray’s ability to auto-scale has a huge impact on deploying AI applications. Imagine being able to launch hundreds of training or tuning jobs in parallel, each with thousands of machines using spot, and scaling back down to zero within seconds.
We have one customer who saves over 18,000 GPU hours of compute PER MONTH thanks to how quickly clusters launch and scale.
Or imagine an online serving application where you suddenly get a large influx of traffic to your AI service thanks to a new feature launch or the enthusiasm of a group of influential devs, forcing you to quickly scale up even if you have large LLMs. With Anyscale, don’t over provision for compute or miss SLA’s waiting for machines to launch.
Run anywhere, your way
The biggest consideration we have when we look to deploy Anyscale for a customer is where they will be running their workloads. Ultimately, we’re a distributed computing platform, and we enable our customers to scale AI workloads anywhere. But those workloads will always be heavily dependent on the data itself. One of our key value propositions is that we can unlock compute anywhere, whether it’s a hosted environment, public cloud, or private cloud.
This offers customers the freedom they need to not only run their models the way they need, with fewer limitations, but also not to have to think about how it all works. And that goes back to our core mission — playing a big, if quiet, role in supporting our customers as they grow, innovate, and refine their technologies.
Many customers take advantage of Cloud Storage and BigQuery to maintain all the data they’re using to train their models. So, for our customers, it made sense for us to prioritize working on Google Cloud as a first class provider. It makes it easier to integrate into their environment and run AI workloads close to their data.
Since Google Cloud is a global provider with a common set of APIs, we can also deploy in all commercial regions to help support security or privacy requirements for customers. That’s helped us rapidly expand to new markets.
While we started with our Compute Engine stack, we needed to complete the vision of “any stack” to support the robust ML ecosystem that has developed around Kubernetes.
Thanks to our work with the Google Kubernetes Engine (GKE) team, we have recently delivered the Anyscale Operator for K8s. Users can deploy Anyscale on their existing Kubernetes clusters, supercharging their AI workloads with the best performance, scalability, cost efficiency, and reliability. Together, RayTurbo, our hyperoptimized version of Ray, and GKE, form the Distributed OS for AI.
Launching AI at massive scale
The compute requirements to train state-of-the-art models has grown 5x – EVERY YEAR! This has been driven by scaling laws which simply say that more data and more compute lead to better models.
More compute obviously means higher costs…and the cost of training has increased more than an order of magnitude every 2 years.
This scaling applies to not only training, but also inference. Recently OpenAI released o1, the most advanced reasoning model where the context can be orders of magnitude larger than before — it can take 10s of seconds to generate a single answer. This is the beginning of a new scaling era for model inference.
And finally, the era of multimodal data is here. Multimodal data like text, audio, images, and video, comprise as much as 80% of an organization’s data are inherently much bigger than structured data.
AI today NEEDS scale.
Anyscale is built on GCE and GKE, meaning customers can scale from a single node and GPU to data center scale. We have one customer that processes 10 million videos a day over 10,000 GPUs, all powered by Google Cloud.
Lesson learned from scaling AI
The pace of AI innovation is staggering. We want to make sure that Ray continues to be the de facto standard for AI/ML workloads, and that Anyscale becomes known as the most performant, secure, and reliable platform to deliver Ray.
We have learned quite a few lessons about scaling AI over the past few years, including the following:
-
Reliability needs to be consistent at any scale. As scale increases, the probability of hardware failure increases. This forced us to double down on important functionality to handle memory limits, monitoring and observability, built-in fault tolerance, retry logic, and checkpointing.
-
Scaling is not just about nodes – it’s about scaling observability, too. The size of logs generated from 5k+ node clusters is extreme, and building tools that work in that environment is often harder than just scaling that size altogether.
-
Speed matters. With large-scale clusters, speed in moving data or processing closer to data is important, as is getting the compute running so you do not idle. Anyscale has focused on all aspects of the stack to enable the fastest speeds possible.
-
Scaling during development avoids a lot of pain. Developers mostly work on single nodes or local laptops. At Anyscale, we’ve built developer workspaces that can scale, allowing better testing in the distributed environment and an ability to clone the production environment and mirror it exactly to troubleshoot.
-
Developer velocity determines project success. Machine learning is a process of trial and error (quite literally). Developers need the ability to move quickly with their experiments and platforms shouldn’t stand in the way of results. Ray’s developer-friendly interface makes it easy to unlock traditionally complicated workloads through its libraries. Meanwhile Anyscale provides the means to develop directly against autoscaling Ray clusters without worrying about underlying infrastructure while providing the observability tools to resolve issues quickly.
-
Performance and efficiency define the bottom line. Compute for AI is expensive. You need to take full advantage of the resources on the most price performance hardware available. Ray is inherently designed to support heterogenous clusters and fractional resource allocations to right size your workloads. Anyscale takes it a step further by continuously landing on the most cost efficient machines across existing reservations, spot, and on demand.
While we continue to work toward enabling more powerful AI technologies, we remain focused on enabling developers to solve their unique AI challenges with performant, reliable, and cost efficient infrastructure. As the rules and capabilities of AI are changing constantly, Anyscale and Google will be ready for whatever comes next.
Read More for the details.