Amazon DynamoDB now supports a new warm throughput value and the ability to easily pre-warm DynamoDB tables and indexes. The warm throughput value provides visibility into the number of read and write operations your DynamoDB tables can readily handle, while pre-warming let’s you proactively increase the value to meet future traffic demands.
DynamoDB automatically scales to support workloads of virtually any size. However, when you have peak events like product launches or shopping events, request rates can surge 10x or even 100x in a short period of time. You can now check your tables’ warm throughput value to assess if your table can handle large traffic spikes for peak events. If you expect an upcoming peak event to exceed the current warm throughput value for a given table, you can pre-warm that table in advance of the peak event to ensure it scales instantly to meet demand.
Warm throughput values are available for all provisioned and on-demand tables and indexes at no cost. Pre-warming your table’s throughput incurs a charge. See Amazon DynamoDB Pricing page for pricing details. This capability is now available in all AWS commercial Regions. See the Developer Guide to learn more.
Amazon SageMaker Model Registry now supports tracking machine learning (ML) model lineage, enabling you to automatically capture and retain information about the steps of an ML workflow, from data preparation and training to model registration and deployment.
Customers use Amazon SageMaker Model Registry as a purpose-built metadata store to manage the entire lifecycle of ML models. With this launch, data scientists and ML engineers can now easily capture and view the model lineage details such as datasets, training jobs, and deployment endpoints in Model Registry. When they register a model, Model Registry begins tracking the lineage of the model from development to deployment. This creates an audit trail that enables traceability and reproducibility, providing visibility across the model lifecycle to improve model governance.
Managing applications across multiple Kubernetes clusters is complex, especially when those clusters span different environments or even cloud providers. One powerful and secure solution combines Google Kubernetes Engine (GKE) fleets and, Argo CD, a declarative, GitOps continuous delivery tool for Kubernetes. The solution is further enhanced with Connect Gateway and Workload Identity.
This blog post guides you in setting up a robust, team-centric multi-cluster infrastructure with these offerings. We use a sample GKE fleet with application clusters for your workloads and a control cluster to host Argo CD. To streamline authentication and enhance security, we leverage Connect Gateway and Workload Identity, enabling Argo CD to securely manage clusters without the need to manage cumbersome Kubernetes Services Accounts.
On top of this, we incorporate GKE Enterprise Teams to manage access and resources, helping to ensure that each team has the right permissions and namespaces within this secure framework.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0eb810d220>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Finally, we introduce the fleet-argocd-plugin, a custom Argo CD generator designed to simplify cluster management within this sophisticated setup. This plugin automatically imports your GKE Fleet cluster list into Argo CD and maintains synchronized cluster information, making it easier for platform admins to manage resources and for application teams to focus on deployments.
Follow along as we:
Create a GKE fleet with application and control clusters
Deploy Argo CD on the control cluster, configured to use Connect Gateway and Workload Identity
Configure GKE Enterprise Teams for granular access control
Install and leverage the fleet-argocd-plugin to manage your secure, multi-cluster fleet with team awareness
By the end, you’ll have a powerful and automated multi-cluster system using GKE Fleets, Argo CD, Connect Gateway, Workload Identity, and Teams, ready to support your organization’s diverse needs and security requirements. Let’s dive in!
Set up multi-cluster infrastructure with GKE fleet and Argo CD
Setting up a sample GKE fleet is a straightforward process:
1. Enable the required APIs in the desired Google Cloud Project. We use this project as the fleet host project.
a. gcloud SDK must be installed, and you must be authenticated via gcloud auth login.
<ListValue: [StructValue([(‘code’, ‘# Create a frontend team. rngcloud container fleet scopes create frontendrnrn# Add your application clusters to the frontend team. rngcloud container fleet memberships bindings create app-cluster-1-b \rn –membership app-cluster-1 \rn –scope frontend \rn –location us-central1rnrngcloud container fleet memberships bindings create app-cluster-2-b \rn –membership app-cluster-2 \rn –scope frontend \rn –location us-central1rnrn# Create a fleet namespace for webserver.rngcloud container fleet scopes namespaces create webserver –scope=frontendrnrn# [Optional] Verify your fleet team setup.rn# Check member clusters in your fleet.rngcloud container fleet memberships list rn# Verify member clusters have been added to the right team (`scope`). rngcloud container fleet memberships bindings list –membership=app-cluster-1 –location=us-central1rngcloud container fleet memberships bindings list –membership=app-cluster-2 –location=us-central1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52760>)])]>
4. Now, set up Argo CD and deploy it to the control cluster. Create a new GKE cluster as your application and enable Workload Identity on it.
5. Install the Argo CD CLI to interact with the Argo CD API server. Version 2.8.0 or higher is required. Detailed installation instructions can be found via the CLI installation documentation.
Now you’ve got your GKE fleet up and running, and you’ve installed Argo CD on the control cluster. In Argo CD, application clusters are registered with the control cluster by storing their credentials (like API server address and authentication details) as Kubernetes Secrets within the Argo CD namespace. We’ve got a way to make this whole process a lot easier!
8. To make sure the fleet-argocd-plugin works as it should, give it the right permissions for fleet management.
a. Create an IAM service account in your Argo CD control cluster and grant it the appropriate permissions. The setup follows the official onboarding guide of GKE Workload Identity Federation.
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud iam service-accounts create argocd-fleet-admin \rn –project=$FLEET_PROJECT_IDrnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/container.developer”rnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/gkehub.gatewayEditor”rnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/gkehub.viewer”rnrn# Allow ArgoCD application controller and fleet-argocd-plugin to impersonate this IAM service account.rngcloud iam service-accounts add-iam-policy-binding argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com \rn–role roles/iam.workloadIdentityUser \rn–member “serviceAccount:$FLEET_PROJECT_ID.svc.id.goog[argocd/argocd-application-controller]”rngcloud iam service-accounts add-iam-policy-binding argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com \rn–role roles/iam.workloadIdentityUser \rn–member “serviceAccount:$FLEET_PROJECT_ID.svc.id.goog[argocd/argocd-fleet-sync]”rnrn# Annotate the Kubernetes ServiceAccount so that GKE sees the link between the service accounts.rnkubectl annotate serviceaccount argocd-application-controller \rn –namespace argocd \rn iam.gke.io/gcp-service-account=argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52fd0>)])]>
b. You also need to allow the Google Compute Engine service account to access images from your artifacts repository.
Let’s do a quick check to make sure the GKE fleet and Argo CD are playing nicely together. You should see that the secrets for your application clusters have been automatically generated.
code_block
<ListValue: [StructValue([(‘code’, ‘kubectl get secret -n argocdrnrn# Example Output: TYPE DATA AGErn# app-cluster-1.us-central1.141594892609 Opaque 3 64mrn# app-cluster-2.us-central1.141594892609 Opaque 3 64m’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52b50>)])]>
Demo 1: Automatic fleet management in Argo CD
Okay, let’s see how this works! We’ll use the guestbook example app. First, we deploy it to the clusters that the frontend team uses. You should then see the guestbook app running on your application clusters, and you won’t have to deal with any cluster secrets manually!
code_block
<ListValue: [StructValue([(‘code’, “export TEAM_ID=frontendrnenvsubst ‘$FLEET_PROJECT_NUMBER $TEAM_ID’ < applicationset-demo.yaml | kubectl apply -f – -n argocdrnrnkubectl config set-context –current –namespace=argocdrnargocd app list -o name rn# Example Output:rn# argocd/app-cluster-1.us-central1.141594892609-webserverrn# argocd/app-cluster-2.us-central1.141594892609-webserver”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52ca0>)])]>
Demo 2: Evolving your fleet is easy with fleet-argocd-plugin
Suppose you decide to add another cluster to the frontend team. Create a new GKE cluster and assign it to the frontend team. Then, check to see if your guestbook app has been deployed on the new cluster.
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud container clusters create app-cluster-3 –enable-fleet –region=us-central1rngcloud container fleet memberships bindings create app-cluster-3-b \rn –membership app-cluster-3 \rn –scope frontend \rn –location us-central1rnrnargocd app list -o namern# Example Output: a new app shows up!rn# argocd/app-cluster-1.us-central1.141594892609-webserverrn# argocd/app-cluster-2.us-central1.141594892609-webserverrn# argocd/app-cluster-3.us-central1.141594892609-webserver’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb521f0>)])]>
Closing thoughts
In this blog post, we’ve shown you how to combine the power of GKE fleets, Argo CD, Connect Gateway, Workload Identity, and GKE Enterprise Teams to create a robust and automated multi-cluster platform. By leveraging these tools, you can streamline your Kubernetes operations, enhance security, and empower your teams to efficiently manage and deploy applications across your fleet.
However, this is just the beginning! There’s much more to explore in the world of multi-cluster Kubernetes. Here are some next steps to further enhance your setup:
Deep dive into GKE Enterprise Teams: Explore the advanced features of GKE Enterprise Teams to fine-tune access control, resource allocation, and namespace management for your teams. Learn more in the official documentation.
Secure your clusters with Connect Gateway: Delve deeper into Connect Gateway and Workload Identity to understand how they simplify and secure authentication to your clusters, eliminating the need for VPNs or complex network configurations. Check out this blog post for a detailed guide.
Master advanced deployment strategies: Explore advanced deployment strategies with Argo CD, such as blue/green deployments, canary releases, and automated rollouts, to achieve zero-downtime deployments and minimize risk during updates. This blog post provides a great starting point.
As you continue your journey with multi-cluster Kubernetes, remember that GKE fleets and Argo CD provide a solid foundation for building a scalable, secure, and efficient platform. Embrace the power of automation, GitOps principles, and team-based management to unlock the full potential of your Kubernetes infrastructure.
As AI models increase in sophistication, there’s increasingly large model data needed to serve them. Loading the models and weights along with necessary frameworks to serve them for inference can add seconds or even minutes of scaling delay, impacting both costs and the end-user’s experience.
For example, inference servers such as Triton, Text Generation Inference (TGI), or vLLM are packaged as containers that are often over 10GB in size; this can make them slow to download, and extend pod startup times in Kubernetes. Then, once the inference pod starts, it needs to load model weights, which can be hundreds of GBs in size, further adding to the data loading problem.
This blog explores techniques to accelerate data loading for both inference serving containers and downloading models + weights, so you can accelerate the overall time to load your AI/ML inference workload on Google Kubernetes Engine (GKE).
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e27d0a72d90>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
1. Accelerating container load times using secondary boot disksto cache container images with your inference engine and applicable libraries directly on the GKE node.
The image above shows a secondary boot disk (1) that stores the container image ahead of time, avoiding the image download process during pod/container startup. And for AI/ML inference workloads with demanding speed and scale requirements, Cloud Storage Fuse (2) and Hyperdisk ML (3) are options to connect the pod to model + weight data stored in Cloud Storage or a network attached disk. Let’s look at each of these approach in more detail below.
Accelerating container load times with secondary boot disks
GKE lets you pre-cache your container image into a secondary boot disk that is attached to your node at creation time. The benefit of loading your containers this way is that you skip the image download step and can begin launching your containers immediately, which drastically improves startup time. The diagram below shows container image download times grow linearly with container image size. Those times are then compared with using a cached version of the container image that is pre-loaded on the node.
Caching a 16GB container image ahead of time on a secondary boot disk has shown reductions in load time of up to 29x when compared with downloading the container image from a container registry. Additionally, this approach lets you benefit from the acceleration independent of container size, allowing for large container images to be loaded predictably fast!
To use secondary boot disks, first create the disk with all your images, create an image out of the disk, and specify the disk image while creating your GKE node pools as a secondary boot disk. For more, see the documentation.
Accelerating model weights load times
Many ML frameworks output their checkpoints (snapshots of model weights) to object storage such as Google Cloud Storage, a common choice for long-term storage. Using Cloud Storage as the source of truth, there are two main products to retrieve your data at the GKE-pod level: Cloud Storage Fuse and Hyperdisk ML (HdML).
When selecting one product or the other there are two main considerations:
Performance – how quickly can the data be loaded by the GKE node
Operational simplicity – how easy is it to update this data
Cloud Storage Fuse provides a direct link to Cloud Storage for model weights that reside in object storage buckets. Additionally there is a caching mechanism for files that need to be read multiple times to prevent additional downloads from the source bucket (which adds latency). Cloud Storage Fuse is appealing because there are no pre-hydration operational activities for a pod to do to download new files in a given bucket. It’s important to note that if you switch buckets that the pod is connected to, you will need to restart the pod with an updated Cloud Storage Fuse configuration. To further improve performance, you can enable parallel downloads, which spawns multiple workers to download a model, significantly improving model pull performance.
Hyperdisk ML gives you better performance and scalability than downloading files directly to the pod from Cloud Storage or other online location. Additionally, you can attach up to 2500 nodes to a single Hyperdisk ML instance, with aggregate bandwidth up 1.2 TiB/sec. This makes it a strong choice for inference workloads that span many nodes and where the same data is downloaded repeatedly in a read-only fashion. To use Hyperdisk ML, load your data on the Hyperdisk ML disk prior to using it, and again upon each update. Note that this adds operational overhead if your data changes frequently.
Which model+weight loading product you use depends on your use case.The table below provides a more detailed comparison of each:
Zonal. Data can be made regional with an automated GKE clone feature to make data available across zones.
Create new persistent volume, load new data, and redeploy pods that have a PVC to reference the new volume.
As you can see there are other considerations besides throughput to take into account when architecting a performant model loading strategy.
Conclusion
Loading large AI models, weights, and container images into GKE-based AI models can delay workload startup times. By using a combination of the three methods described above — secondary boot disk for container images, Hyperdisk ML OR Cloud Storage Fuse for models + weights — get ready to accelerate data load times for your AI/ML inference applications.
AWS Control Tower customers can now use the ResetEnabledControl API to programmatically resolve the control drift or re-deploy the control to its intended configuration. A control drift occurs when the AWS Control Tower managed control is modified outside the AWS Control Tower governance. Resolving drift helps you to adhere to your governance and compliance requirements. You can use this API with all AWS Control Tower optional controls except service control policies(SCPs) based preventive controls. AWS Control Tower APIs enhance the end-to-end developer experience by enabling automation for integrated workflows and managing workloads at scale.
Below is the list of AWS Control Tower control APIs that are now supported in the regions where AWS Control Tower is available. Please visit the AWS Control Tower API reference for more information.
AWS Control Tower Control APIs – EnableControl, DisableControl, GetControlOperation, GetEnabledControl, ListEnabledControls, UpdateEnabledControl, TagResource, UnTagResource, ListTagsForResource, ResetEnabledControl API.
To learn more, visit the AWS Control Tower homepage. For more information about the AWS Regions where AWS Control Tower is available, see the AWS Region table.
Starting today, Amazon Elastic Compute Cloud (Amazon EC2) M2 Mac instances are now generally available (GA) in the AWS Canada (Central) region. This marks the first time we are introducing Mac instances to an AWS Canadian region, providing customers with even greater global accessibility to Apple silicon hardware. Customers can now run their macOS workloads in AWS Canada (Central) region to satisfy their data residency requirements, benefit from improved latency to end-users, while also integrating with their pre-existing AWS environment configurations within this region.
M2 Mac instances deliver up to 10% faster performance over M1 Mac instances when building and testing applications for Apple platforms such as iOS, macOS, iPadOS, tvOS, watchOS, visionOS, and Safari. M2 Mac instances are powered by the AWS Nitro System and are built on Apple M2 Mac Mini computers featuring 8 core CPU, 10 core GPU, 24 GiB of memory, and 16 core Apple Neural Engine.
With this expansion, EC2 M2 Mac instances are available across US East (N.Virginia, Ohio), US West (Oregon), Europe (Frankfurt), Asia Pacific (Sydney), and Canada (Central) regions. To learn more or get started, see Amazon EC2 Mac Instances or visit the EC2 Mac documentation reference.
AWS Directory Service for Microsoft Active Directory, also known as AWS Managed Microsoft AD, and AD Connector are now available in the AWS Asia Pacific (Malaysia) Region.
Built on actual Microsoft Active Directory (AD), AWS Managed Microsoft AD enables you to migrate AD-aware applications while reducing the work of managing AD infrastructure in the AWS Cloud. You can use your Microsoft AD credentials to connect to AWS applications such as Amazon Relational Database Service (RDS) for SQL Server, Amazon RDS for PostgreSQL, and Amazon RDS for Oracle. You can keep your identities in your existing Microsoft AD or create and manage identities in your AWS managed directory.
AD Connector is a proxy that enables AWS applications to use your existing on-premises AD identities without requiring AD infrastructure in the AWS Cloud. You can also use AD Connector to join Amazon EC2 instances to your on-premises AD domain and manage these instances using your existing group policies.
You can now use Amazon Timestream for InfluxDB in the Amazon Web Services China (Beijing) Region, operated by Sinnet and Amazon Web Services China (Ningxia) Region, operated by NWC. Timestream for InfluxDB makes it easy for application developers and DevOps teams to run fully managed InfluxDB databases on Amazon Web Services for real-time time-series applications using open-source APIs.
Timestream for InfluxDB offers the full feature set available in the InfluxDB 2.7 release of the open-source version, and adds deployment options with Multi-AZ high availability and enhanced durability. For high availability, Timestream for InfluxDB allows you to automatically create a primary database instance and synchronously replicate the data to an instance in a different Availability Zone. When it detects a failure, Timestream for InfluxDB automatically fails over to a standby instance without manual intervention.
With the latest release, customers can use Amazon Timestream for InfluxDB in the following regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Canada (Central), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Asia Pacific (Jakarta), Europe (Paris), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Milan), Europe (Stockholm), Europe (Spain), Middle East (UAE), Amazon Web Services China (Beijing) Region, operated by Sinnet, and Amazon Web Services China (Ningxia) Region, operated by NWCD. To get started with Amazon Timestream, visit our product page.
As generative AI evolves, we’re beginning to see the transformative potential it is having across industries and our lives. And as large language models (LLMs) increase in size — current models are reaching hundreds of billions of parameters, and the most advanced ones are approaching 2 trillion — the need for computational power will only intensify. In fact, training these large models on modern accelerators already requires clusters that exceed 10,000 nodes.
With support for 15,000-node clusters — the world’s largest — Google Kubernetes Engine (GKE) has the capacity to handle these demanding training workloads. Today, in anticipation of even larger models, we are introducing support for 65,000-node clusters.
With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e8fa5165eb0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Unmatched scale for training or inference
Scaling to 65,000 nodes provides much-needed capacity to the world’s most resource-hungry AI workloads. Combined with innovations in accelerator computing power, this will enable customers to reduce model training time or scale models to multi-trillion parameters or more. Each node is equipped with multiple accelerators (e.g., Cloud TPU v5e node with four chips), giving the ability to manage over 250,000 accelerators in one cluster.
To develop cutting-edge AI models, customers need to be able to allocate computing resources across diverse workloads. This includes not only model training but also serving, inference, conducting ad hoc research, and managing auxiliary tasks. Centralizing computing power within the smallest number of clusters provides customers the flexibility to quickly adapt to changes in demand from inference serving, research and training workloads.
With support for 65,000 nodes, GKE now allows running five jobs in a single cluster, each matching the scale of Google Cloud’s previous world record for the world’s largest training job for LLMs.
Customers on the cutting edge of AI welcome these developments. Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems, and is excited for GKE’s expanded scale.
“GKE’s new support for larger clusters provides the scale we need to accelerate our pace of AI innovation.” – James Bradbury, Head of Compute, Anthropic
Innovations under the hood
This achievement is made possible by a variety of enhancements: For one, we are transitioning GKE from the open-source etcd, distributed key-value store, to a new, more robust, key-value store based on Spanner, Google’s distributed database that delivers virtually unlimited scale. On top of the ability to support larger GKE clusters, this change will usher in new levels of reliability for GKE users, providing improved latency of cluster operations (e.g., cluster startup and upgrades) and a stateless cluster control plane. By implementing the etcd API for our Spanner-based storage, we help ensure backward compatibility and avoid having to make changes in core Kubernetes to adopt the new technology.
In addition, thanks to a major overhaul of the GKE infrastructure that manages the Kubernetes control plane, GKE now scales significantly faster, meeting the demands of your deployments with fewer delays. This enhanced cluster control plane delivers multiple benefits, including the ability to run high-volume operations with exceptional consistency. The control plane now automatically adjusts to these operations, while maintaining predictable operational latencies. This is particularly important for large and dynamic applications such as SaaS, disaster recovery and fallback, batch deployments, and testing environments, especially during periods of high churn.
We’re also constantly innovating on IaaS and GKE capabilities to make Google Cloud the best place to build your AI workloads. Recent innovations in this space include:
Secondary boot disk, which provides faster workload startups through container image caching
Custom compute classes, which offer greater control over compute resource allocation and scaling
Support for Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date
Support for A3 Ultra VM powered by NVIDIA H200 Tensor Core GPUs with our new Titanium ML network adapter, which delivers non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). A3 Ultra VMs will be available in preview next month.
A continued commitment to open source
Guided by Google’s long-standing and robust open-source culture, we make substantial contributions to the open-source community, including when it comes to scaling Kubernetes. With support for 65,000-node clusters, we made sure that all necessary optimizations and improvements for such scale are part of the core open-source Kubernetes.
Our investments to make Kubernetes the best foundation for AI platforms go beyond scalability. Here is a sampling of our contributions to the Kubernetes project over the past two years:
Incubated the K8S Batch Working Group to build a community around research, HPC and AI workloads, producing tools like Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes
Created the JobSet operator that is being integrated into the Kubeflow ecosystem to help run heterogenous jobs (e.g., driver-executer)
For multihost inference use cases, created the Leader Worker Set controller
Published a highly optimized internal model server of JetStream
Incubated the Kubernetes Serving Working Group, which is driving multiple efforts including model metrics standardization, Serving Catalog and Inference Gateway
At Google Cloud, we’re dedicated to providing the best platform for running containerized workloads, consistently pushing the boundaries of innovation. These new advancements allow us to support the next generation of AI technologies. For more, listen to the Kubernetes podcast, where Maciek Rozacki and Wojtek Tyczynski join host Kaslin Fields to talk about GKE’s support for 65,000 nodes. You can also see a demo on 65,000 nodes on a single GKE cluster here.
Rapidly evolving generative AI models place unprecedented demands on the performance and efficiency of hardware accelerators. Last month, we launched our sixth-generation Tensor Processing Unit (TPU), Trillium, to address the demands of next-generation models. Trillium is purpose-built for performance at scale, from the chip to the system to our Google data center deployments, to power ultra-large scale training.
Today, we present our first MLPerf training benchmark results for Trillium. The MLPerf 4.1 training benchmarks show that Trillium delivers up to 1.8x better performance-per-dollar compared to prior-generation Cloud TPU v5p and an impressive 99% scaling efficiency (throughput).
In this blog, we offer a concise analysis of Trillium’s performance, demonstrating why it stands out as the most performant and cost-efficient TPU training system to date. We begin with a quick overview of system comparison metrics, starting with traditional scaling efficiency. We introduce convergence scaling efficiency as a crucial metric to consider in addition to scaling efficiency. We assess these two metrics along with performance per dollar and present a comparative view of Trillium against Cloud TPU v5p. We conclude with guidance that you can use to make an informed choice for your cloud accelerators.
Traditional performance metrics
Accelerator systems can be evaluated and compared across multiple dimensions, ranging from peak throughput, to effective throughput, to throughput scaling efficiency. Each of these metrics are helpful indicators but do not take convergence time into consideration.
Hardware specifications and peak performance
Traditionally, comparisons focused on hardware specifications like peak throughput, memory bandwidth, and network connectivity. While these peak values establish theoretical boundaries, they are bad at predicting real-world performance, which depends heavily on architectural design and software implementation. Since modern ML workloads typically span hundreds or thousands of accelerators, the key metric is the effective throughput of an appropriately sized system for specific workloads.
Utilization performance
System performance can be quantified through utilization metrics like effective model FLOPS utilization (EMFU) and memory bandwidth utilization (MBU), which measure achieved throughput versus peak capacity. However, these hardware efficiency metrics don’t directly translate to business-value measures like training time or model quality.
Scaling efficiency and trade-offs
A system’s scalability is evaluated through both strong scaling (performance improvement with system size for fixed workloads) and weak scaling (efficiency when increasing both workload and system size proportionally). While both metrics are valuable indicators, the ultimate goal is to achieve high-quality models quickly, sometimes making it worthwhile to trade scaling efficiency for faster training time or better model convergence.
The need for convergence scaling efficiency
While hardware utilization and scaling metrics provide important system insights, convergence scaling efficiency focuses on the fundamental goal of training: reaching model convergence efficiently. Convergence refers to the point where a model’s output stops improving and the error rate becomes constant. Convergence scaling efficiency measures how effectively additional computing resources accelerate the training process to completion.
We define convergence scaling efficiency using two key measurements: the base case, where a cluster of N₀ accelerators achieves convergence in time T₀, and a scaled case with N₁ accelerators taking time T₁ to converge. The ratio of the speedup in convergence time to the increase in cluster size gives us:
A convergence scaling efficiency of 1 indicates that time-to-solution improves by the same ratio as the cluster size. It is therefore desirable to have convergence scaling efficiency as close to 1 as possible.
Now let’s apply these concepts to understand our ML Perf submission for GPT3-175b training task using Trillium and Cloud TPU v5p.
Trillium performance
We submitted GPT3-175b training results for four different Trillium configurations, and three different Cloud TPU v5p configurations. In the following analysis, we group the results by cluster sizes with the same total peak flops for comparison purposes. For example, the Cloud TPU v5p-4096 configuration is compared to 4xTrillium-256, and Cloud TPU v5p-8192 is compared with 8xTrillium-256, and so on.
All results presented in this analysis are based on MaxText, our high-performance reference implementation for Cloud TPUs and GPUs.
Weak scaling efficiency
For increasing cluster sizes with proportionately larger batch-sizes, both Trillium and TPU v5p deliver near linear scaling efficiency:
Figure-1: Weak scaling comparison for Trillium and Cloud TPU v5p. v5p-4096 and 4xTrillium-256 are considered as base for scaling factor measurement. n x Trillium-256 corresponds to n Trillium pods with 256 chips in one ICI domain. v5p-n corresponds to n/2 v5p chips in a single ICI domain.
Figure 1 demonstrates relative throughput scaling as cluster sizes increase from the base configuration. Trillium achieves 99% scaling efficiency even when operating across data-center networks using Cloud TPU multislice technology, outperforming the 94% scaling efficiency of Cloud TPU v5p cluster within a single ICI domain. For these comparisons, we used a base configuration of 1024 chips (4x Trillium-256 pods), establishing a consistent baseline with the smallest v5p submission (v5p-4096; 2048 chips). When measured against our smallest submitted configuration of 2x Trillium-256 pods, Trillium maintains a strong 97.6% scaling efficiency.
Convergence scaling efficiency
As stated above, weak scaling is useful but not a sufficient indicator of value, while convergence scaling efficiency brings time-to-solution into consideration.
Figure-2: Convergence scaling comparison for Trillium and Cloud TPU v5p.
For the largest cluster size, we observed comparable convergence scaling efficiency for Trillium and Cloud TPU v5p. In this example, a CSE of 0.8 means that for the rightmost configuration, the cluster size was 3x the (base) configuration, while the time to convergence improved by 2.4x with respect to the base configuration (2.4/3 = 0.8).
While the convergence scaling efficiency is comparable between Trillium and TPU v5p, where Trillium really shines is by delivering the convergence at a lower cost, which brings us to the last metric.
Cost-to-train
While weak scaling efficiency and convergence scaling efficiency indicate scaling properties of systems, we’ve yet to look at the most crucial metric: the cost to train.
Figure-3: Comparison of cost-to-train based on the wall-clock time and the on-demand list price for Cloud TPU v5p and Trillium.
Trillium lowers the cost to train by up to 1.8x (45% lower) compared to TPU v5p while delivering convergence to the same validation accuracy.
Making informed cloud accelerator choices
In this article, we explored the complexities of comparing accelerator systems, emphasizing the importance of looking beyond simple metrics to assess true performance and efficiency. We saw that while peak performance metrics provide a starting point, they often fall short in predicting real-world utility. Instead, metrics like Effective Model Flops Utilization (EMFU) and Memory Bandwidth Utilization (MBU) offer more meaningful insights into an accelerator’s capabilities.
We also highlighted the critical importance of scaling characteristics — both strong and weak scaling — in evaluating how systems perform as workloads and resources grow. However, the most objective measure we identified is the convergence scaling efficiency, which ensures that we’re comparing systems based on their ability to achieve the same end result, rather than just raw speed.
Applying these metrics to our benchmark submission with GPT3-175b training, we demonstrated that Trillium achieves comparable convergence scaling efficiency to Cloud TPU v5p while delivering up to 1.8x better performance per dollar, thereby lowering the cost-to-train. These results highlight the importance of evaluating accelerator systems through multiple dimensions of performance and efficiency.
For ML-accelerator evaluation, we recommend a comprehensive analysis combining resource utilization metrics (EMFU, MBU), scaling characteristics, and convergence scaling efficiency. This multi-faceted approach enables you to make data-driven decisions based on your specific workload requirements and scale.
Every November, we start sharing forward-looking insights on threats and other cybersecurity topics to help organizations and defenders prepare for the year ahead. The Cybersecurity Forecast 2025 report, available today, plays a big role in helping us accomplish this mission.
This year’s report draws on insights directly from Google Cloud’s security leaders, as well as dozens of analysts, researchers, responders, reverse engineers, and other experts on the frontlines of the latest and largest attacks.
Built on trends we are already seeing today, the Cybersecurity Forecast 2025 report provides a realistic outlook of what organizations can expect to face in the coming year. The report covers a lot of topics across all of cybersecurity, with a focus on various threats such as:
Attacker Use of Artificial Intelligence (AI): Threat actors will increasingly use AI for sophisticated phishing, vishing, and social engineering attacks. They will also leverage deepfakes for identity theft, fraud, and bypassing security measures.
AI for Information Operations (IO): IO actors will use AI to scale content creation, produce more persuasive content, and enhance inauthentic personas.
The Big Four: Russia, China, Iran, and North Korea will remain active, engaging in espionage operations, cyber crime, and information operations aligned with their geopolitical interests.
Ransomware and Multifaceted Extortion: Ransomware and multifaceted extortion will continue to be the most disruptive form of cyber crime, impacting various sectors and countries.
Infostealer Malware: Infostealer malware will continue to be a major threat, enabling data breaches and account compromises.
Democratization of Cyber Capabilities: Increased access to tools and services will lower barriers to entry for less-skilled actors.
Compromised Identities: Compromised identities in hybrid environments will pose significant risks.
Web3 and Crypto Heists: Web3 and cryptocurrency organizations will increasingly be targeted by attackers seeking to steal digital assets.
Faster Exploitation and More Vendors Targeted: The time to exploit vulnerabilities will continue to decrease, and the range of targeted vendors will expand.
Be Prepared for 2025
Read the Cybersecurity Forecast 2025 report for a more in-depth look at these and other threats, as well as other security topics such as post-quantum cryptography, and insights unique to the JAPAC and EMEA regions.
For an even deeper look at the threat landscape next year, register for our Cybersecurity Forecast 2025 webinar, which will be hosted once again by threat expert Andrew Kopcienski.
For even more insights, hear directly from our security leaders: Charles Carmakal, Sandra Joyce, Sunil Potti, and Phil Venables.
Amazon DynamoDB now enables customers to easily find frequently used tables in the DynamoDB console in the AWS GovCloud (US) Regions. Customers can favorite their tables in the console’s tables page for quicker table access.
Customers can click the favorites icon to view their favorited tables in the console’s tables page. With this update, customers have a faster and more efficient way to find and work with tables that they often monitor, manage, and explore.
Customers can start using favorite tables at no additional cost. Get started with creating a DynamoDB table from the AWS Management Console.
Today, AWS announced support for a new Apache Flink connector for Amazon DynamoDB. The new connector, contributed by AWS for the Apache Flink open source project, adds Amazon DynamoDB Streams as a new source for Apache Flink. You can now process DynamoDB streams events with Apache Flink, a popular framework and engine for processing and analyzing streaming data.
Amazon DynamoDB is a serverless, NoSQL database service that enables you to develop modern applications at any scale. DynamoDB Streams provides a time-ordered sequence of item-level changes (insert, update, and delete) in a DynamoDB table. With Amazon Managed Service for Apache Flink, you can transform and analyze DynamoDB streams data in real time using Apache Flink and integrate applications with other AWS services such as Amazon S3, Amazon OpenSearch, Amazon Managed Streaming for Apache Kafka, and more. Apache Flink connectors are software components that move data into and out of an Amazon Managed Service for Apache Flink application. You can use the new connector to read data from a DynamoDB stream starting with Apache Flink version 1.19. With Amazon Managed Service for Apache Flink there are no servers and clusters to manage, and there is no compute and storage infrastructure to set up.
The Apache Flink repo for AWS connectors can be found here. For detailed documentation and setup instructions, visit our Documentation Page.
Today, we are excited to announce that Amazon SageMaker Model Registry now supports custom machine learning (ML) model lifecycle stages. This capability further improves model governance by enabling data scientists and ML engineers to define and control the progression of their models across various stages, from development to production.
Customers use Amazon SageMaker Model Registry as a purpose-built metadata store to manage the entire lifecycle of ML models. With this launch, data scientists and ML engineers can now define custom stages such as development, testing, and production for ML models in the model registry. This makes it easy to track and manage models as they transition across different stages in the model lifecycle from training to inference. They can also track stage approval status such as Pending Approval, Approved, and Rejected to check when the model is ready to move to the next stage. These custom stages and approval status help data scientists and ML engineers define and enforce model approval workflows, ensuring that models meet specific criteria before advancing to the next stage. By implementing these custom stages and approval processes, customers can standardize their model governance practices across their organization, maintain better oversight of model progression, and ensure that only approved models reach production environments.
This capability is available in all AWS regions where Amazon SageMaker Model Registry is currently available except GovCloud regions. To learn more, see Staging Construct for your Model Lifecycle.
Starting today, customers can use Amazon Managed Service for Apache Flink in Asia Pacific (Kuala Lumpur) Region to build real-time stream processing applications.
Amazon Managed Service for Apache Flink makes it easier to transform and analyze streaming data in real time with Apache Flink. Apache Flink is an open source framework and engine for processing data streams. Amazon Managed Service for Apache Flink reduces the complexity of building and managing Apache Flink applications and integrates with Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon OpenSearch Service, Amazon DynamoDB streams, Amazon Simple Storage Service (Amazon S3), custom integrations, and more using built-in connectors.
For a list of the AWS Regions where Amazon Managed Service for Apache Flink is available, please see the AWS Region Table.
You can learn more about Amazon Managed Service for Apache Flink here.
Today, Amazon Web Services announces that Amazon Elastic Compute Cloud (Amazon EC2) Capacity Blocks for ML is available for P5 instances in two new regions: US West (Oregon) and Asia Pacific (Tokyo). You can use EC2 Capacity Blocks to reserve highly sought-after GPU instances in Amazon EC2 UltraClusters for a future date for the amount of time that you need to run your machine learning (ML) workloads.
EC2 Capacity Blocks enable you to reserve GPU capacity up to eight weeks in advance for durations up to 28 days in cluster sizes of one to 64 instances (512 GPUs), giving you the flexibility to run a broad range of ML workloads. They are ideal for short duration pre-training and fine-tuning workloads, rapid prototyping, and for handling surges in inference demand. EC2 Capacity Blocks deliver low-latency, high-throughput connectivity through colocation in Amazon EC2 UltraClusters.
With this expansion, EC2 Capacity Blocks for ML are available for the following instance types and AWS Regions: P5 instances in US East (N. Virginia), US East (Ohio), US West (Oregon), and Asia Pacific (Tokyo); P5e instances in US East (Ohio); P4d instances in US East (Ohio) and US West (Oregon); Trn1 instances in Asia Pacific (Melbourne).
AWS CodeBuild now supports building Windows docker images in reserved capacity fleets. AWS CodeBuild is a fully managed continuous integration service that compiles source code, runs tests, and produces software packages ready for deployment.
Additionally, you can bring in your own Amazon Machine Images (AMIs) in reserved capacity for Linux and Windows platforms. This enables you to customize your build environment including building and testing with different kernel modules, for more flexibility.
The feature is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), South America (Sao Paulo), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), Asia Pacific (Mumbai), Europe (Ireland), and Europe (Frankfurt) where reserved capacity fleets are supported.
Today, AWS announces the availability of a new financing program supported by PNC Vendor Finance, enabling select customers in the United States (US) to finance AWS Marketplace software purchases directly from the AWS Billing and Cost Management console. For the first time, select US customers can apply for, utilize, and manage financing within the console for AWS Marketplace software purchases.
AWS Marketplace helps customers find, try, buy, and launch third-party software, while consolidating billing and management with AWS. With thousands of software products available in AWS Marketplace, this financing program enables you to buy the software you need to drive innovation. With financing amounts ranging from $10,000 – $100,000,000, subject to credit approval, you have more options to pay for your AWS Marketplace purchases. If approved, you can utilize financing for AWS Marketplace software purchases that have at least 12-month contracts. Financing can be applied to multiple purchases from multiple AWS Marketplace sellers. This financing program gives you the flexibility to better manage your cash flow by spreading payments over time, while only paying financing cost on what you use.
This new financing program supported by PNC Vendor Finance is available in the AWS Billing and Cost Management console for select AWS Marketplace customers in the US, excluding NV, NC, ND, TN, & VT.
To learn more about financing options for AWS Marketplace purchases and details about the financing program supported by PNC Vendor Finance, visit the AWS Marketplace financing page.
Today, Amazon announced the availability of detailed performance statistics for Amazon Elastic Block Store (EBS) volumes. This new capability provides you with real-time visibility into the performance of your EBS volumes, making it easier to monitor the health of your storage resources and take actions sooner.
With detailed performance statistics, you can access 11 metrics at up to a per-second granularity to monitor input/output (I/O) statistics of your EBS volumes, including driven I/O and I/O latency histograms. The granular visibility provided by these metrics helps you quickly identify and proactively troubleshoot application performance bottlenecks that may be caused by factors such as reaching an EBS volume’s provisioned IOPS or throughput limits, enabling you to enhance application performance and resiliency.
Detailed performance statistics for EBS volumes are available by default for all EBS volumes attached to a Nitro-based EC2 instance in all AWS Commercial, China, and the AWS GovCloud (US) Regions, at no additional charge.
To get started with EBS detailed performance statistics, please visit the documentation here to learn more about the available metrics and how to access them using NVMe-CLI tools.
Amazon Neptune Serverless is now available in the Europe (Paris), South America (Sao Paulo), Asia Pacific (Jakarta), Asia Pacific (Mumbai), Asia Pacific (Hong Kong), and Asia Pacific (Seoul) AWS Regions.
Amazon Neptune is a fast, reliable, and fully managed graph database service for building and running applications with highly connected datasets, such as knowledge graphs, fraud graphs, identity graphs, and security graphs. If you have unpredictable and variable workloads, Neptune Serverless automatically determines and provisions the compute and memory resources to run the graph database. Database capacity scales up and down based on the application’s changing requirements to maintain consistent performance, saving up to 90% in database costs compared to provisioning at peak capacity.
With today’s launch, Neptune Serverless is available in 19 AWS Regions. For pricing and region availability, please visit the Neptune pricing page.