2025 05 09

GCP – From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer

From retail to gaming, from code generation to customer care, an increasing number of organizations are running LLM-based applications, with 78% of organizations in development or production today. As the number of generative AI applications and volume of users scale, the need for performant, scalable, and easy to use inference technologies is critical. At Google Cloud, we’re paving the way for this next phase of AI’s rapid evolution with our AI Hypercomputer.

At Google Cloud Next 25, we shared many updates to AI Hypercomputer’s inference capabilities, unveiling Ironwood, our newest Tensor Processing Unit (TPU) designed specifically for inference, coupled with software enhancements such as simple and performant inference using vLLM on TPU and the latest GKE inference capabilities — GKE Inference Gateway and GKE Inference Quickstart.

With AI Hypercomputer, we also continue to push the envelope for performance with optimized software, backed by strong benchmarks:

Google’s JetStream inference engine incorporates new performance optimizations, integrating Pathways for ultra-low latency multi-host, disaggregated serving.
MaxDiffusion, our reference implementation of latent diffusion models, delivers standout performance on TPUs for compute-heavy image generation workloads, and now supports Flux, one of the largest text-to-image generation models to date.

The latest performance results from MLPerf™ Inference v5.0 demonstrate the power and versatility of Google Cloud’s A3 Ultra (NVIDIA H200) and A4 (NVIDIA HGX B200) VMs for inference.

Optimizing performance for JetStream: Google’s JAX inference engine

To maximize performance and reduce inference costs, we are excited to offer more choice when serving LLMs on TPU, further enhancing JetStream and bringing vLLM support for TPU, a widely-adopted fast and efficient library for serving LLMs. With both vLLM on TPU and JetStream, we deliver standout price-performance with low-latency, high-throughput inference and community support through open-source contributions and from Google AI experts.

JetStream is Google’s open-source, throughput- and memory-optimized inference engine, purpose-built for TPUs and based on the same inference stack used to serve Gemini models. Since we announced JetStream last April, we have invested significantly in further improving its performance across a wide range of open models. When using JetStream, our sixth-generation Trillium TPU now exceeds throughput performance by 2.9x for Llama2-70B and 2.8x for Mixtral 8x7B compared to TPU v5e (using our reference implementation MaxText).

Figure 1: JetStream throughput (output tokens / second). Google internal data. Measured using Llama2-70B (MaxText) on Cloud TPU v5e-8 and Trillium 8-chips and Mixtral 8x7B (MaxText) on Cloud TPU v5e-4 and Trillium 4-chips. Maximum input length: 1024, maximum output length: 1024. As of April 2025.

Available for the first time for Google Cloud customers, Google’s Pathways runtime is now integrated into JetStream, enabling multi-host inference and disaggregated serving — two important features as model sizes grow exponentially and generative AI demands evolve.

Multi-host inference using Pathways distributes the model across multiple accelerators hosts when serving. This enables the inference of large models that don’t fit on a single host. With multi-host inference, JetStream achieves 1703 token/s on Llama3.1-405B on Trillium. This translates to three times more inference per dollar compared to TPU v5e.

In addition, with Pathways, disaggregated serving capabilities allow workloads to dynamically scale LLM inference’s decode and prefill stages independently. This allows for better utilization of resources and can lead to improvements in performance and efficiency, especially for large models. For Llama2-70B, using multiple hosts with disaggregated serving performs seven times better for prefill (time-to-first-token, TTFT) operations, and nearly three times better for token generation (time-per-output-token, TPOT) compared with interleaving the prefill and decode stages of LLM request processing on the same server on Trillium.

Figure 2: Measured using Llama2-70B (MaxText) on Cloud TPU Trillium 16-chips (8 chips allocated for prefill server, 8 chips allocated for decode server). Measured using the OpenOrca dataset. Maximum input length: 1024, maximum output length: 1024. As of April 2025.

Customers like Osmos are using TPUs to maximize cost-efficiency for inference at scale:

“Osmos is building the world’s first AI Data Engineer. This requires us to deploy AI technologies at the cutting edge of what is possible today. We are excited to continue our journey building on Google TPUs as our AI infrastructure for training and inference. We have vLLM and JetStream in scaled production deployment on Trillium and are able to achieve industry leading performance at over 3500 tokens/sec per v6e node for long sequence inference for 70B class models. This gives us industry leading tokens/sec/$, comparable to not just other hardware infrastructure, but also fully managed inference services. The availability of TPUs and the ease of deployment on AI Hypercomputer lets us build out an Enterprise software offering with confidence.” – Kirat Pandya, CEO, Osmos

MaxDiffusion: High-performance diffusion model inference

Beyond LLMs, Trillium demonstrates standout performance on compute-heavy workloads like image generation. MaxDiffusion delivers a collection of reference implementations of various latent diffusion models. In addition to Stable Diffusion inference, we have expanded MaxDiffusion to now support Flux; with 12 billion parameters, Flux is one of the largest open source text-to-image models to date.

As demonstrated on MLPerf 5.0, Trillium now delivers 3.5x throughput improvement for queries/second on Stable Diffusion XL (SDXL) compared to last performance round for its predecessor, TPU v5e. This further improves throughput by 12% since the MLPerf 4.1 submission.

Figure 3: MaxDiffusion throughput (images per second). Google internal data. Measured using the SDXL model on Cloud TPU v5e-4 and Trillium 4-chip. Resolution: 1024×1024, batch size per device: 16, decode steps: 20. As of April 2025.

With this throughput, MaxDiffusion delivers a cost-efficient solution. The cost to generate 1000 images is as low as 22 cents on Trillium, 35% less compared to TPU v5e.

Figure 4: Diffusion cost to generate 1000 images. Google internal data. Measured using the SDXL model on Cloud TPU v5e-4 and Cloud TPU Trillium 4-chip. Resolution: 1024×1024, batch size per device: 2, decode steps: 4. Cost is based on the 3Y CUD prices for Cloud TPU v5e-4 and Cloud TPU Trillium 4-chip in the US. As of April 2025.

A3 Ultra and A4 VMs MLPerf 5.0 Inference results

For MLPerf™ Inference v5.0, we submitted 15 results, including our first submission with A3 Ultra (NVIDIA H200) and A4 (NVIDIA HGX B200) VMs. The A3 Ultra VM is powered by eight NVIDIA H200 Tensor Core GPUs and offers 3.2 Tbps of GPU-to-GPU non-blocking network bandwidth and twice the high bandwidth memory (HBM) compared to A3 Mega with NVIDIA H100 GPUs. Google Cloud’s A3 Ultra demonstrated highly competitive performance, achieving results comparable to NVIDIA’s peak GPU submissions across LLMs, MoE, image, and recommendation models.

Google Cloud was the only cloud provider to submit results on NVIDIA HGX B200 GPUs, demonstrating excellent performance of A4 VM for serving LLMs including Llama 3.1 405B (a new benchmark introduced in MLPerf 5.0). A3 Ultra and A4 VMs both deliver powerful inference performance, a testament to our deep partnership with NVIDIA to provide infrastructure for the most demanding AI workloads.

Customers like JetBrains are using Google Cloud GPU instances to accelerate their inference workloads:

“We’ve been using A3 Mega VMs with NVIDIA H100 Tensor Core GPUs on Google Cloud to run LLM inference across multiple regions. Now, we’re excited to start using A4 VMs powered by NVIDIA HGX B200 GPUs, which we expect will further reduce latency and enhance the responsiveness of AI in JetBrains IDEs.” – Vladislav Tankov, Director of AI, JetBrains

AI Hypercomputer is powering the age of AI inference

Google’s innovations in AI inference, including hardware advancements in Google Cloud TPUs and NVIDIA GPUs, plus software innovations such as JetStream, MaxText, and MaxDiffusion, are enabling AI breakthroughs with integrated software frameworks and hardware accelerators. Learn more about using AI Hypercomputer for inference. Then, check out these JetStream and MaxDiffusion recipes to get started today.

GCP – From LLMs to image generation: Accelerate inference workloads with AI Hypercomputer

Optimizing performance for JetStream: Google’s JAX inference engine

MaxDiffusion: High-performance diffusion model inference

A3 Ultra and A4 VMs MLPerf 5.0 Inference results

AI Hypercomputer is powering the age of AI inference

Related Posts

GCP – Redefining enterprise data with agents and AI-native foundations

GCP – Spanner columnar engine: Powering next-generation analytics on operational data

GCP – Announcing AI-first Colab notebook experience for Google Cloud