As open-source large language models (LLMs) become increasingly popular, developers are looking for better ways to access new models and deploy them on Cloud Run GPU. That’s why Cloud Run now offers fully managed NVIDIA GPUs, which removes the complexity of driver installations and library configurations. This means you’ll benefit from the same on-demand availability and effortless scalability that you love with Cloud Run’s CPU and memory, with the added power of NVIDIA GPUs. When your application is idle, your GPU-equipped instances automatically scale down to zero, optimizing your costs.
In this blog post, we’ll guide you through deploying the Meta Llama 3.2 1B Instruction model on Cloud Run. We’ll also share best practices to streamline your development process using local model testing with Text Generation Inference (TGI) Docker image, making troubleshooting easy and boosting your productivity.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3e6f0cf8f040>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Why Cloud Run with GPU?
There are four critical reasons developers benefit from deploying open models on Cloud Run with GPU:
Fully managed: No need to worry about drivers, libraries, or infrastructure.
On-demand scaling: Scale up or down automatically based on demand.
Cost effective: Only pay for what you use, with automatic scaling down to zero when idle.
Performance: NVIDIA GPU-optimized for Meta Llama 3.2.
Initial Setup
First, create a Hugging Face token.
Second, check that your Hugging Face token has permission to access and download Llama 3.2 model weight here. Keep your token handy for the next step.
Third, use Google Cloud’s Secret Manager to store your Hugging Face token securely. In this example, we will be using Google user credentials. You may need to authenticate for using gcloud CLI, setting default project ID, and enable necessary APIs, and grant access to Secret Manager and Cloud Storage.
code_block
<ListValue: [StructValue([(‘code’, ‘# Authenticate CLIrngcloud auth loginrnrn# Set default projectrngcloud config set project <your_project_id>rnrn# Create new secret key, remember to update <your_huggingface_token>rngcloud secrets create HF_TOKEN –replication-policy=”automatic”rnecho -n <your_huggingface_token> | gcloud secrets versions add HF_TOKEN –data-file=-rnrn# Retrieve the keyrnHF_TOKEN=$(gcloud secrets versions access latest –secret=”HF_TOKEN”)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6f0cf8f490>)])]>
Local debugging
Install huggingface_cli python package in your virtual environment.
Run huggingface-cli login to set up a Hugging Face credential.
Use the TGI Docker image to test your model locally. This allows you to iterate and debug your model locally before deploying it to Cloud Run.
Now, we will create a new Cloud Run service using the deployment script as follows. (Remember to update BUCKET_NAME). You may also need to update the network and subnet name as well.
New solutions, old problems. Artificial intelligence (AI) and large language models (LLMs) are here to signal a new day in the cybersecurity world, but what does that mean for us—the attackers and defenders—and our battle to improve security through all the noise?
Data is everywhere. For most organizations, the access to security data is no longer the primary issue. Rather, it is the vast quantities of it, the noise in it, and the disjointed and spread-out nature of it. Understanding and making sense of it—THAT is the real challenge.
When we conduct adversarial emulation (red team) engagements, making sense of all the network, user, and domain data available to us is how we find the path forward. From a defensive perspective, efficiently finding the sharpest and most dangerous needles in the haystack—for example, easily accessible credentials on fileshares—is how we prioritize, improve, and defend.
How do you make sense of this vast amount of structured and unstructured data, and give yourself the advantage?
Data permeates the modern organization. This data can be challenging to parse, process, and understand from a security implication perspective, but AI might just change all that.
This blog post will focus on a number of case studies where we obtained data during our complex adversarial emulation engagements with our global clients, and how we innovated using AI and LLM systems to process this into structured data that could be used to better defend organizations. We will showcase the lessons learned and key takeaways for all organizations and highlight other problems that can be solved with this approach for both red and blue teams.
Approach
Data parsing and understanding is one of the biggest early benefits of AI. We have seen many situations where AI can help process data at a fast rate. Throughout this post, we use an LLM to process unstructured data, meaning that the data did not have a structure or format that we knew about before parsing the data.
If you want to try these examples out yourself, please make sure you use either a local model, or you have permission to send the data to an external service.
Getting Structured Data Out of an LLM
Step one is to get the data into a format we can use. If you ever used an LLM, you will have noticed it will output as a story or prose text, especially if you use chat-based versions. For a lot of use cases, this is fine; however, we want to analyze the data and get structured data. Thus, the first problem we have to solve is to get the LLM to output the data in a format we can specify. The simple method is to ask the LLM to output the data in a machine readable format like JSON, XML, or CSV. However, you will quickly notice that you have to be quite specific with the data format, and the LLM can easily output data in another format, ignoring your instructions.
Luckily for us, other people have encountered this problem and have solved it with something called Guardrails. One of the projects we have found is called guardrails-ai. It is a Python library that allows you to create guardrails—specific requirements—for a model based on Pydantic.
To illustrate, take a simple Python class from the documentation to validate a pet from the output of the LLM:
from pydantic import BaseModel, Field
class Pet(BaseModel):
pet_type: str = Field(description="Species of pet")
name: str = Field(description="a unique pet name")
You can use the next code from the Guardrails documentation to process the output of the LLM into a structured object:
from guardrails import Guard
import openai
prompt = """
What kind of pet should I get and what should I name it?
${gr.complete_json_suffix_v2}
"""
guard = Guard.from_pydantic(output_class=Pet, prompt=prompt)
raw_output, validated_output, *rest = guard(
llm_api=openai.completions.create,
engine="gpt-3.5-turbo-instruct"
)
print(validated_output)
If we look at what this library generates underwater for this prompt, we see that it adds a structured object part with the instructions for the LLM to output data in a specific way. This streamlines the way you can get structured data from an LLM.
Figure 1: The generated prompt from the Pydantic model
For the next use case, we will show the Pydantic models we’ve created to process the output.
Red Team Use Cases
The next sections contain some use cases where we can use an LLM to get structured data out of data obtained. The use cases are divided into three categories of the attack lifecycle:
Initial Reconnaissance
Escalate Privileges
Internal Reconnaissance
Figure 2: Attack lifecycle
Initial Reconnaissance
Open Source Intelligence (OSINT) is an important part of red teaming. It includes gathering data about the target organization from news articles, social media, and corporate reports.
This information can then be used in other red team phases such as during phishing. For defenders, it helps them understand which parts of their organization are exposed to the internet, anticipating a possible future attack. In the next use case, we talk about processing social media information to process roles and extract useful information.
Use Case 1: Social Media Job Functions Information
During OSINT, we often try to get information from employees about their function in their company. This helps with performing phishing attacks, as we do not want to target IT professionals, especially those that work in cybersecurity.
Social media sites allow their users to write about their job titles in a free format. This means that the information is unstructured and can be written in any language and any format.
We can try to extract the information from the title with simple matches; however, because the users can fill in anything and in any language, this problem can be better solved with an LLM.
Data Model
First, we create a Pydantic model for the Guardrail:
class RoleOutput(BaseModel):
role: str = Field(description="Role being analyzed")
it: bool = Field(description="The role is related to IT")
cybersecurity: bool = Field(description="The role is related to
CyberSecurity")
experience_level: str = Field(
description="Experience level of the role.",
)
This model has two Boolean options if the role is IT or cybersecurity related. Additionally, we would like to know the experience level of the role.
Prompt
Next, let’s create a prompt to instruct the LLM to extract the requested information from the role. This prompt is quite simple and just asks the LLM to fill in the data.
Given the following role, answer the following questions.
If the answer doesn't exist in the role, enter ``.
${role}
${gr.complete_xml_suffix_v2}
The two last lines are placeholders used by guardrails-ai.
Results
To test the models, we have scraped the titles that employees use on social media. This dataset contained the titles that the employees used and contained 235 entries. For testing, we used the gemini-1.0-pro model.
Gemini managed to parse 232 entries. The results are shown in Table 1.
Not IT
IT
Cybersecurity
Gemini
183
49
5
Manual evaluation by a red team operator
185
47
5
False positive
1
3
0
Table 1: Results of Gemini parsing 232 job title entries
In the end, Gemini processed the roles quite on par with a human. Most of the false positives were questionable because it is not very clear if the role was actually IT related. The experience level did not perform well, as the model deemed the experience level as “unknown” or “none” for most of the entries. To resolve this issue, the field was changed so that the experience level should be a number from 1 to 10. After running the analysis again, this yielded better results for the experience level. The lowest experience levels (1–4) contained function titles like “intern,” “specialist,” or “assistant.” This usually indicated that the person had been employed at that role for a shorter period of time. The updated data model is shown as follows:
class RoleOutput(BaseModel):
role: str = Field(description="Role being analyzed")
it: bool = Field(description="The role is related to IT")
cybersecurity: bool = Field(description="The role is related to
CyberSecurity")
experience_level: int = Field(
description="Estimate of the experience level of the role on
a scale of 1-10. Where 1 is low experience and 10 is high.",
)
This approach helped us to sort through a large dataset of phishing targets by identifying employees that did not have IT and cybersecurity roles, and sorting them by experience level. This can speed up target selection for large organizations and may allow us to better emulate attackers by changing the prompts or selection criteria. To defend against this, data analysis is more difficult. In theory, you can instruct all your employees to include “Cybersecurity” in their role, but that does not scale well or solve the underlying phishing problem. The best approach with regards to phishing is, in our experience, to invest into phishing resistant multifactor authentication (MFA) and application allowlisting. If applied well, these solutions can mitigate phishing attacks as an initial access vector.
Escalate Privileges
Once attackers establish a foothold into an organization, one of their first acts is often to improve their level of access or control through privilege escalation. There are quite a few methods that can be used for this. It comes in a local system-based variety as well as wider domain-wide types, with some based on exploits or misconfigurations, and others based on finding sensitive information when searching through files.
Our focus will be on the final aspect, which aligns with our challenge of identifying the desired information within the vast amount of data, like finding a needle in a haystack.
Use Case 2: Credentials in Files
After gaining initial access to the target network, one of the more common enumeration methods employed by attackers is to perform share enumeration and try to locate interesting files. There are quite a few tools that can do this, such as Snaffler.
After you identify files that potentially contain credentials, you can go through them manually to find useful ones. However, if you do this in a large organization, there is a chance that you will have hundreds to thousands of hits. In that case, there are some tools that can help with finding and classifying credentials like TruffleHog and Nosey Parker. Additionally, the Python library detect-secrets can help with this task.
Most of these tools look for common patterns or file types that they understand. To cover unknown file types or credentials in emails or other formats, it might instead be valuable to use an LLM to analyze the files to find any unknown or unrecognized formats.
Technically, we can just run all tools and use a linear regression model to combine the results into one. An anonymized example of a file with a password that we encountered during our tests is shown as follows:
@Echo Off
Net Use /Del * /Yes
Set /p Path=<"path.txt"
Net Use %Path% Welcome01@ /User:CHAOS.LOCALWorkstationAdmin
If Not Exist "C:Data" MKDIR "C:Data"
Copy %Path%. C:Data
Timeout 02
Data Model
We used the following Python classes to instruct Gemini to retrieve credentials with an optional domain. One file can contain multiple credentials, so we use a list of credentials to instruct Gemini to optionally retrieve multiple credentials from one file.
class Credential(BaseModel):
password: str = Field(description="Potential password of an account")
username: str = Field(description="Potential username of an account")
domain: Optional[str] = Field(
description="Optional domain of an account", default=""
)
class ListOfCredentials(BaseModel):
credentials: list[Credential] = []
Prompt
In the prompt, we give some examples of what kind of systems we are looking for, and output into JSON once again:
Given the following file, check if there are credentials in the file.
Only include results if there is at least one username and password.
If the domain doesn't exist in the file, enter `` as a default value.
${file}
${gr.complete_xml_suffix_v2}
Results
We tested on 600 files, where 304 contain credentials and 296 do not. Testing occurred with the gemini-1.5 model. Each file took about five seconds to process.
To compare results with other tools, we also tested Nosey Parker and TruffleHog. Both NoseyParker and Truffle Hog are made to find credentials in a structured way in files, including repositories. Their use case is usually for known file formats and randomly structured files.
The results are summarized in Table 2.
Tool
True Negative
False Positive
False Negative
True Positive
Nosey Parker
284 (47%)
12 (2%)
136 (23%)
168 (28%)
TruffleHog
294 (49%)
2 (<1%)
180 (30%)
124 (21%)
Gemini
278 (46%)
18 (3%)
23 (4%)
281 (47%)
Table 2: Results of testing for credentials in files, where 304 contain them and 296 do not
In this context, the definitions of true negative, false positive, false negative, and true positive are as follows:
True Negative: A file does not contain any credentials, and the tool correctly indicates that there are no credentials.
False Positive: The tool incorrectly indicates that a file contains credentials when it does not.
False Negative: The tool incorrectly indicates that a file does not contain any credentials when it does.
True Positive: The tool correctly indicates that a file contains credentials.
In conclusion, Gemini finds the most files with credentials, at a cost of a slightly higher false positive rate. TruffleHog has the lowest false positive rate, but also finds the least amount of true positives. This is to be expected, as a higher true positive rate usually is accompanied by a higher false positive rate. The current dataset has almost an equal number of files with and without credentials—in real-world scenarios this ratio can differ wildly, which means that the false positive rate is still important even though the percentages are quite close.
To optimize this approach, you can use all three tools, combine the output signals to a single signal, and then sort the potential files based on this combined signal.
Defenders can, and should, use the same techniques previously described to enumerate the internal file shares and remove or limit access to files that contain credentials. Make sure to check what file shares each server and workstation exposes to the network, because in some cases file shares are exposed accidentally or were forgotten about.
Internal Reconnaissance
When attackers have gained a better position in the network, the next step in their playbooks is understanding the domain in which they have landed so they can construct a path to their ultimate goal. This could be full domain control or access to specific systems or users, depending on the threat actor’s mission. From a red team perspective, we need to be able to emulate this. From a defender’s perspective, we need to find these paths before the attackers exploit them.
The main tool that red teamers use to analyze Active Directory is BloodHound, which uses a graph database to find paths in the Active Directory. BloodHound is executed in two steps. First, an ingester retrieves the data from the target Active Directory. Second, this data is ingested and analyzed by BloodHound to find attack paths.
Some tools that can gather data to be used in BloodHound are:
Sharphound
Bloodhound.py
Rusthound
Adexplorer
Bofhound
Soaphound
These tools gather data from the Active Directory and other systems and output it in a format that BloodHound can read. In theory, if we have all the information about the network in the graph, then we can just query the graph to figure out how to achieve our objective.
To improve the data in BloodHound, we have thought of additional use cases. Use Case 3 is about finding high-value systems. Discovering more hidden edges in BloodHound is part of Use Case 4 and Use Case 5.
Use Case 3: High-Value Target Detection in Active Directory
By default, BloodHound deems some groups and computers as high value. One of the main activities in internal reconnaissance is figuring out which systems in the client’s network are high-value targets. Some examples of systems that we are interested in, and that can lead to domain compromise, are:
Backup systems
SCCM
Certificate services
Exchange
WSUS systems
There are many ways to indicate which servers are used for a certain function, and it depends on how the IT administrators have configured it in their domain. There are some fields that may contain data in various forms to indicate what the system is used for. This is a prime example of unstructured data that might be analyzable with an LLM.
The following fields in the Active Directory might contain the relevant information:
Name
Samaccountname
Description
Distinguishedname
SPNs
Data Model
In the end, we would like to have a list of names of the systems the LLM has deemed high value. During development, we noticed that LLM results improved dramatically if you asked it to specify a reason. Thus, our Pydantic model looks like this:
class HighValueSystem(BaseModel):
name: str = Field(description="Name of this system")
reason: str = Field(description="Reason why this system is
high value", default="")
class HighValueResults(BaseModel):
systems: list[HighValueSystem] = Field(description="high value
systems", default=[])
Prompt
In the prompt, we give some examples of what kind of systems we are looking for:
Given the data, identify which systems are high value targets,
look for: sccm servers, jump systems, certificate systems, backup
systems and other valuable systems. Use the first (name) field to
identify the systems.
Results
We tested this prompt on a dataset of 400 systems and executed it five times. All systems were sent in one query to the model. To accommodate this, we used the gemini-1.5 model because it has a huge context window. Here are some examples of reasons Gemini provided, and what we think the reason was based off:
Domain controller: Looks like this was based on the “OU=Domain Controllers” distinguishedname field of BloodHound
Jumpbox: Based on the “OU=Jumpboxes,OU=Bastion Servers” distinguishedname
Lansweeper: Based on the description field of the computer
Backup Server: Based on “OU=Backup Servers” distinguishedname
Some of the high-value targets are valid yet already known, like domain controllers. Others are good finds, like the jumpbox and backup servers. This method can process system names in other languages and more verbose descriptions of systems to determine systems that may be high value. Additionally, this method can be adapted to allow for a more specific query—for example, that might suit a different client environment:
Given the data, identify which systems are related to
SWIFT. Use the first (name) field to identify the systems.
In this case, the LLM will look for SWIFT servers and may save you some time searching for it manually. This approach can potentially be even better when you combine this data with internal documentation to give you results, even if the Active Directory information is lacking any information about the usage of the system.
For defenders, there are some ways to deal with this situation:
Limit the amount of information in the Active Directory and put the system descriptions in your documentation instead of within the Active Directory
Limit the amount of information a regular user can retrieve from the Active Directory
Monitor LDAP queries to see if a large amount of data is being retrieved from LDAP
Use Case 4: User Clustering
After gaining an initial strong position, and understanding the systems in the network, attackers will often need to find the right users to compromise to gain further privileges in the domain. For defenders, legacy user accounts or administrators with too many rights is a common security issue.
Administrators often have multiple user accounts: one for normal operations like reading email and using it on their workstations, and one or multiple administrator accounts. This separation is done to make it harder for attackers to compromise the administrator account.
There are some common flaws in the implementations that sometimes make it possible to bypass these separations. Most of the methods require the attacker to cluster the users together to see which accounts belong to the same employee. In many cases, this can be done by inspecting the Active Directory objects and searching for patterns in the display name, description, or other fields. To automate this, we tried to find these patterns with Gemini.
Data Model
For this use case, we would like to have the account’s names that Gemini clusters together. During initial testing, the results were quite random. However, after adding a “reason” field, the results improved dramatically. So we used the next Pydantic model:
class User(BaseModel):
accounts: list[Account] = Field(
description="accounts that probably belongs
to this user", default=[]
)
reason: str = Field(
description="Reason why these accounts belong
to this user", default=""
)
class UserAccountResults(BaseModel):
users: list[User] = Field(description="users with multiple
accounts", default=[])
Prompt
In the prompt, we give some examples of what kind of systems we are looking for:
Given the data, cluster the accounts that belong to a single person
by checking for similarities in the name, displayname and sam.
Only include results that are likely to be the same user. Only include
results when there is a user with multiple accounts. It is possible
that a user has more than two accounts. Please specify a reason
why those accounts belong to the same user. Use the first (name)
field to identify the accounts.
Results
The test dataset had about 900 users. We manually determined that some users have two to four accounts with various permissions. Some of these accounts had the same pattern like “user@test.local” and “adm-user@test.local.” However, other accounts had patterns where the admin account was based on the first couple of letters. For example, their main account had the pattern matthijs.gielen@test.local, and the admin account was named: adm-magi@test.local. To keep track of those accounts, the description of the admin account contained some text similar to “admin account of Matthijs Gielen.”
With this prompt, Gemini managed to cluster 50 groups of accounts in our dataset. After manual verification, some of the results were discarded because they only contained one account in the cluster. This resulted in 43 correct clusters of accounts. Manually, we found the same correlation; however, where Gemini managed to output this information in a couple of minutes, manually this took quite a bit longer to analyze and correlate all accounts. This information was used in preparation for further attacks, as shown in the next use case.
Use Case 5: Correlation Between Users and Their Machines
Knowing which users to target or defend is often not enough. We also need to find them within the network in order to compromise them. Domain administrators are (usually) physical people; they need somewhere to type in their commands and perform administrative actions. This means that we need to correlate which domain administrator is working from which workstation. This is called session information, and BloodHound uses this information in an edge called “HasSession.”
In the past, it was possible to get all session information with a regular user during red teaming.
Using the technique in Use Case 4, we can correlate the different user accounts that one employee may have. The next step is to figure out which workstation belongs to that employee. Then we can target that workstation, and from there, hopefully recover the passwords of their administrator accounts.
In this case, employees have corporate laptops, and the company needs to keep track of which laptop belongs to which employee. Often this information is stored in one of the fields of the computer object in the Active Directory. However, there are many ways to do this, and using Gemini to parse the unstructured data is one such example.
Data Model
This model is quite simple, we just want to correlate machines to their users and have Gemini give us a reason why—to improve the output of the model. Because we will send all users and all computers at once, we will need a list of results.
class UserComputerCorrelation(BaseModel):
user: str = Field(description="name of the user")
computer: str = Field(description="name of the computer")
reason: str = Field(
description="Reason why these accounts belong to this user",
default=""
)
class CorrelationResults(BaseModel):
results: list[UserComputerCorrelation] = Field(
description="users and computers that correlate", default=[]
)
Prompt
In the prompt, we give some examples of what kind of systems we are looking for:
Given the two data sets, find the computer that correlates
to a user by checking for similarities in the name, displayname
and sam. Only include results that are likely to correspond.
Please specify a reason why that user and computer correlates.
Use the first (name) field to identify the users and computers.
Results
The dataset used contains around 900 users and 400 computers. During the assignment, we determined that the administrators correlated users and their machines via the description field of the computer, which was sort of equal to the display name of the user. Gemini correctly picked up this connection, correctly correlating around 120 users to their respective laptops (Figure 3).
Figure 3: Connections between user and laptop as correlated by Gemini
Gemini helped us to select an appropriate workstation, which enabled us to perform lateral movement to a workstation and obtain the password of an administrator, getting us closer to our goal.
To defend against these threats, it can be valuable to run tools like BloodHound in the network. As discussed, BloodHound might not find all the “hidden” edges in your network, but you can add these yourself to the graph. This will allow you to find more Active Directory-based attack paths that are possible in your network and mitigate these before an attacker has an opportunity to exploit those attack paths.
Conclusion
In this blog post, we looked at processing red team data using LLMs to aid in adversarial emulation or improving defenses. These use cases were related to processing human-generated, unstructured data. Table 3 summarizes the results.
Use Case
Accuracy of the Results
Usefulness
Roles
High: There were a few false positives that were in the gray area.
High: Especially when going through a large list of roles of users, this approach will provide fairly fast results.
Credentials in files
High: Found more credentials than comparable tools. More testing should look into the false-positive rate in real scenarios.
Medium: This approach finds a lot more results; however, processing it with Gemini is a lot slower (five seconds per file) than many other alternatives.
High-value targets
Medium: Not all results were new, nor were all high-value targets.
Medium: Some of the results were useful; however, all of them still require manual verification.
Account clustering
High: After taking into account the clusters with one account, the other ones were well clustered.
High: Clustering users is most of the time a tedious process to do manually. It gives fairly reliable results if you filter out the results with only one account.
Computer correlation
High: All results were correctly correlated users to their computers.
High: This approach produces accurate results potentially providing insights into extra possible attack paths.
Table 3: The results of our experiments of data processing with Gemini
As the results show, using an LLM like Gemini can help in converting this type of data into structured data to aid attackers and defenders. However, keep in mind that LLMs are not a silver bullet and have limitations. For example, they can sometimes produce false positives or be slow to process large amounts of data.
There are quite a few use cases we have not covered in this blog post. Some other examples where you can use this approach are:
Correlating user groups to administrator privileges on workstations and servers
Summarizing internal website content or documentation to search for target systems
Ingesting documentation to generate password candidates for cracking passwords
The Future
This was just an initial step that we on the Advanced Capabilities team on the Mandiant Red Team have explored so far when using LLMs for adversarial emulation and defense. For next steps, we know that the models and prompts can be improved by testing variations in the prompts, and other data sources can be investigated to see if Gemini can help analyze them. We are also looking at using linear regression models as well as clustering and pathfinding algorithms to enable cybersecurity practitioners to quickly evaluate attack paths that may exist in a network.
Managing applications across multiple Kubernetes clusters is complex, especially when those clusters span different environments or even cloud providers. One powerful and secure solution combines Google Kubernetes Engine (GKE) fleets and, Argo CD, a declarative, GitOps continuous delivery tool for Kubernetes. The solution is further enhanced with Connect Gateway and Workload Identity.
This blog post guides you in setting up a robust, team-centric multi-cluster infrastructure with these offerings. We use a sample GKE fleet with application clusters for your workloads and a control cluster to host Argo CD. To streamline authentication and enhance security, we leverage Connect Gateway and Workload Identity, enabling Argo CD to securely manage clusters without the need to manage cumbersome Kubernetes Services Accounts.
On top of this, we incorporate GKE Enterprise Teams to manage access and resources, helping to ensure that each team has the right permissions and namespaces within this secure framework.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e0eb810d220>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Finally, we introduce the fleet-argocd-plugin, a custom Argo CD generator designed to simplify cluster management within this sophisticated setup. This plugin automatically imports your GKE Fleet cluster list into Argo CD and maintains synchronized cluster information, making it easier for platform admins to manage resources and for application teams to focus on deployments.
Follow along as we:
Create a GKE fleet with application and control clusters
Deploy Argo CD on the control cluster, configured to use Connect Gateway and Workload Identity
Configure GKE Enterprise Teams for granular access control
Install and leverage the fleet-argocd-plugin to manage your secure, multi-cluster fleet with team awareness
By the end, you’ll have a powerful and automated multi-cluster system using GKE Fleets, Argo CD, Connect Gateway, Workload Identity, and Teams, ready to support your organization’s diverse needs and security requirements. Let’s dive in!
Set up multi-cluster infrastructure with GKE fleet and Argo CD
Setting up a sample GKE fleet is a straightforward process:
1. Enable the required APIs in the desired Google Cloud Project. We use this project as the fleet host project.
a. gcloud SDK must be installed, and you must be authenticated via gcloud auth login.
<ListValue: [StructValue([(‘code’, ‘# Create a frontend team. rngcloud container fleet scopes create frontendrnrn# Add your application clusters to the frontend team. rngcloud container fleet memberships bindings create app-cluster-1-b \rn –membership app-cluster-1 \rn –scope frontend \rn –location us-central1rnrngcloud container fleet memberships bindings create app-cluster-2-b \rn –membership app-cluster-2 \rn –scope frontend \rn –location us-central1rnrn# Create a fleet namespace for webserver.rngcloud container fleet scopes namespaces create webserver –scope=frontendrnrn# [Optional] Verify your fleet team setup.rn# Check member clusters in your fleet.rngcloud container fleet memberships list rn# Verify member clusters have been added to the right team (`scope`). rngcloud container fleet memberships bindings list –membership=app-cluster-1 –location=us-central1rngcloud container fleet memberships bindings list –membership=app-cluster-2 –location=us-central1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52760>)])]>
4. Now, set up Argo CD and deploy it to the control cluster. Create a new GKE cluster as your application and enable Workload Identity on it.
5. Install the Argo CD CLI to interact with the Argo CD API server. Version 2.8.0 or higher is required. Detailed installation instructions can be found via the CLI installation documentation.
Now you’ve got your GKE fleet up and running, and you’ve installed Argo CD on the control cluster. In Argo CD, application clusters are registered with the control cluster by storing their credentials (like API server address and authentication details) as Kubernetes Secrets within the Argo CD namespace. We’ve got a way to make this whole process a lot easier!
8. To make sure the fleet-argocd-plugin works as it should, give it the right permissions for fleet management.
a. Create an IAM service account in your Argo CD control cluster and grant it the appropriate permissions. The setup follows the official onboarding guide of GKE Workload Identity Federation.
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud iam service-accounts create argocd-fleet-admin \rn –project=$FLEET_PROJECT_IDrnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/container.developer”rnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/gkehub.gatewayEditor”rnrngcloud projects add-iam-policy-binding $FLEET_PROJECT_ID \rn–member “serviceAccount:argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com” \rn–role “roles/gkehub.viewer”rnrn# Allow ArgoCD application controller and fleet-argocd-plugin to impersonate this IAM service account.rngcloud iam service-accounts add-iam-policy-binding argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com \rn–role roles/iam.workloadIdentityUser \rn–member “serviceAccount:$FLEET_PROJECT_ID.svc.id.goog[argocd/argocd-application-controller]”rngcloud iam service-accounts add-iam-policy-binding argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com \rn–role roles/iam.workloadIdentityUser \rn–member “serviceAccount:$FLEET_PROJECT_ID.svc.id.goog[argocd/argocd-fleet-sync]”rnrn# Annotate the Kubernetes ServiceAccount so that GKE sees the link between the service accounts.rnkubectl annotate serviceaccount argocd-application-controller \rn –namespace argocd \rn iam.gke.io/gcp-service-account=argocd-fleet-admin@$FLEET_PROJECT_ID.iam.gserviceaccount.com’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52fd0>)])]>
b. You also need to allow the Google Compute Engine service account to access images from your artifacts repository.
Let’s do a quick check to make sure the GKE fleet and Argo CD are playing nicely together. You should see that the secrets for your application clusters have been automatically generated.
code_block
<ListValue: [StructValue([(‘code’, ‘kubectl get secret -n argocdrnrn# Example Output: TYPE DATA AGErn# app-cluster-1.us-central1.141594892609 Opaque 3 64mrn# app-cluster-2.us-central1.141594892609 Opaque 3 64m’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52b50>)])]>
Demo 1: Automatic fleet management in Argo CD
Okay, let’s see how this works! We’ll use the guestbook example app. First, we deploy it to the clusters that the frontend team uses. You should then see the guestbook app running on your application clusters, and you won’t have to deal with any cluster secrets manually!
code_block
<ListValue: [StructValue([(‘code’, “export TEAM_ID=frontendrnenvsubst ‘$FLEET_PROJECT_NUMBER $TEAM_ID’ < applicationset-demo.yaml | kubectl apply -f – -n argocdrnrnkubectl config set-context –current –namespace=argocdrnargocd app list -o name rn# Example Output:rn# argocd/app-cluster-1.us-central1.141594892609-webserverrn# argocd/app-cluster-2.us-central1.141594892609-webserver”), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb52ca0>)])]>
Demo 2: Evolving your fleet is easy with fleet-argocd-plugin
Suppose you decide to add another cluster to the frontend team. Create a new GKE cluster and assign it to the frontend team. Then, check to see if your guestbook app has been deployed on the new cluster.
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud container clusters create app-cluster-3 –enable-fleet –region=us-central1rngcloud container fleet memberships bindings create app-cluster-3-b \rn –membership app-cluster-3 \rn –scope frontend \rn –location us-central1rnrnargocd app list -o namern# Example Output: a new app shows up!rn# argocd/app-cluster-1.us-central1.141594892609-webserverrn# argocd/app-cluster-2.us-central1.141594892609-webserverrn# argocd/app-cluster-3.us-central1.141594892609-webserver’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0ebbb521f0>)])]>
Closing thoughts
In this blog post, we’ve shown you how to combine the power of GKE fleets, Argo CD, Connect Gateway, Workload Identity, and GKE Enterprise Teams to create a robust and automated multi-cluster platform. By leveraging these tools, you can streamline your Kubernetes operations, enhance security, and empower your teams to efficiently manage and deploy applications across your fleet.
However, this is just the beginning! There’s much more to explore in the world of multi-cluster Kubernetes. Here are some next steps to further enhance your setup:
Deep dive into GKE Enterprise Teams: Explore the advanced features of GKE Enterprise Teams to fine-tune access control, resource allocation, and namespace management for your teams. Learn more in the official documentation.
Secure your clusters with Connect Gateway: Delve deeper into Connect Gateway and Workload Identity to understand how they simplify and secure authentication to your clusters, eliminating the need for VPNs or complex network configurations. Check out this blog post for a detailed guide.
Master advanced deployment strategies: Explore advanced deployment strategies with Argo CD, such as blue/green deployments, canary releases, and automated rollouts, to achieve zero-downtime deployments and minimize risk during updates. This blog post provides a great starting point.
As you continue your journey with multi-cluster Kubernetes, remember that GKE fleets and Argo CD provide a solid foundation for building a scalable, secure, and efficient platform. Embrace the power of automation, GitOps principles, and team-based management to unlock the full potential of your Kubernetes infrastructure.
As AI models increase in sophistication, there’s increasingly large model data needed to serve them. Loading the models and weights along with necessary frameworks to serve them for inference can add seconds or even minutes of scaling delay, impacting both costs and the end-user’s experience.
For example, inference servers such as Triton, Text Generation Inference (TGI), or vLLM are packaged as containers that are often over 10GB in size; this can make them slow to download, and extend pod startup times in Kubernetes. Then, once the inference pod starts, it needs to load model weights, which can be hundreds of GBs in size, further adding to the data loading problem.
This blog explores techniques to accelerate data loading for both inference serving containers and downloading models + weights, so you can accelerate the overall time to load your AI/ML inference workload on Google Kubernetes Engine (GKE).
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e27d0a72d90>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
1. Accelerating container load times using secondary boot disksto cache container images with your inference engine and applicable libraries directly on the GKE node.
The image above shows a secondary boot disk (1) that stores the container image ahead of time, avoiding the image download process during pod/container startup. And for AI/ML inference workloads with demanding speed and scale requirements, Cloud Storage Fuse (2) and Hyperdisk ML (3) are options to connect the pod to model + weight data stored in Cloud Storage or a network attached disk. Let’s look at each of these approach in more detail below.
Accelerating container load times with secondary boot disks
GKE lets you pre-cache your container image into a secondary boot disk that is attached to your node at creation time. The benefit of loading your containers this way is that you skip the image download step and can begin launching your containers immediately, which drastically improves startup time. The diagram below shows container image download times grow linearly with container image size. Those times are then compared with using a cached version of the container image that is pre-loaded on the node.
Caching a 16GB container image ahead of time on a secondary boot disk has shown reductions in load time of up to 29x when compared with downloading the container image from a container registry. Additionally, this approach lets you benefit from the acceleration independent of container size, allowing for large container images to be loaded predictably fast!
To use secondary boot disks, first create the disk with all your images, create an image out of the disk, and specify the disk image while creating your GKE node pools as a secondary boot disk. For more, see the documentation.
Accelerating model weights load times
Many ML frameworks output their checkpoints (snapshots of model weights) to object storage such as Google Cloud Storage, a common choice for long-term storage. Using Cloud Storage as the source of truth, there are two main products to retrieve your data at the GKE-pod level: Cloud Storage Fuse and Hyperdisk ML (HdML).
When selecting one product or the other there are two main considerations:
Performance – how quickly can the data be loaded by the GKE node
Operational simplicity – how easy is it to update this data
Cloud Storage Fuse provides a direct link to Cloud Storage for model weights that reside in object storage buckets. Additionally there is a caching mechanism for files that need to be read multiple times to prevent additional downloads from the source bucket (which adds latency). Cloud Storage Fuse is appealing because there are no pre-hydration operational activities for a pod to do to download new files in a given bucket. It’s important to note that if you switch buckets that the pod is connected to, you will need to restart the pod with an updated Cloud Storage Fuse configuration. To further improve performance, you can enable parallel downloads, which spawns multiple workers to download a model, significantly improving model pull performance.
Hyperdisk ML gives you better performance and scalability than downloading files directly to the pod from Cloud Storage or other online location. Additionally, you can attach up to 2500 nodes to a single Hyperdisk ML instance, with aggregate bandwidth up 1.2 TiB/sec. This makes it a strong choice for inference workloads that span many nodes and where the same data is downloaded repeatedly in a read-only fashion. To use Hyperdisk ML, load your data on the Hyperdisk ML disk prior to using it, and again upon each update. Note that this adds operational overhead if your data changes frequently.
Which model+weight loading product you use depends on your use case.The table below provides a more detailed comparison of each:
Zonal. Data can be made regional with an automated GKE clone feature to make data available across zones.
Create new persistent volume, load new data, and redeploy pods that have a PVC to reference the new volume.
As you can see there are other considerations besides throughput to take into account when architecting a performant model loading strategy.
Conclusion
Loading large AI models, weights, and container images into GKE-based AI models can delay workload startup times. By using a combination of the three methods described above — secondary boot disk for container images, Hyperdisk ML OR Cloud Storage Fuse for models + weights — get ready to accelerate data load times for your AI/ML inference applications.
As generative AI evolves, we’re beginning to see the transformative potential it is having across industries and our lives. And as large language models (LLMs) increase in size — current models are reaching hundreds of billions of parameters, and the most advanced ones are approaching 2 trillion — the need for computational power will only intensify. In fact, training these large models on modern accelerators already requires clusters that exceed 10,000 nodes.
With support for 15,000-node clusters — the world’s largest — Google Kubernetes Engine (GKE) has the capacity to handle these demanding training workloads. Today, in anticipation of even larger models, we are introducing support for 65,000-node clusters.
With support for up to 65,000 nodes, we believe GKE offers more than 10X larger scale than the other two largest public cloud providers.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e8fa5165eb0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Unmatched scale for training or inference
Scaling to 65,000 nodes provides much-needed capacity to the world’s most resource-hungry AI workloads. Combined with innovations in accelerator computing power, this will enable customers to reduce model training time or scale models to multi-trillion parameters or more. Each node is equipped with multiple accelerators (e.g., Cloud TPU v5e node with four chips), giving the ability to manage over 250,000 accelerators in one cluster.
To develop cutting-edge AI models, customers need to be able to allocate computing resources across diverse workloads. This includes not only model training but also serving, inference, conducting ad hoc research, and managing auxiliary tasks. Centralizing computing power within the smallest number of clusters provides customers the flexibility to quickly adapt to changes in demand from inference serving, research and training workloads.
With support for 65,000 nodes, GKE now allows running five jobs in a single cluster, each matching the scale of Google Cloud’s previous world record for the world’s largest training job for LLMs.
Customers on the cutting edge of AI welcome these developments. Anthropic is an AI safety and research company that’s working to build reliable, interpretable, and steerable AI systems, and is excited for GKE’s expanded scale.
“GKE’s new support for larger clusters provides the scale we need to accelerate our pace of AI innovation.” – James Bradbury, Head of Compute, Anthropic
Innovations under the hood
This achievement is made possible by a variety of enhancements: For one, we are transitioning GKE from the open-source etcd, distributed key-value store, to a new, more robust, key-value store based on Spanner, Google’s distributed database that delivers virtually unlimited scale. On top of the ability to support larger GKE clusters, this change will usher in new levels of reliability for GKE users, providing improved latency of cluster operations (e.g., cluster startup and upgrades) and a stateless cluster control plane. By implementing the etcd API for our Spanner-based storage, we help ensure backward compatibility and avoid having to make changes in core Kubernetes to adopt the new technology.
In addition, thanks to a major overhaul of the GKE infrastructure that manages the Kubernetes control plane, GKE now scales significantly faster, meeting the demands of your deployments with fewer delays. This enhanced cluster control plane delivers multiple benefits, including the ability to run high-volume operations with exceptional consistency. The control plane now automatically adjusts to these operations, while maintaining predictable operational latencies. This is particularly important for large and dynamic applications such as SaaS, disaster recovery and fallback, batch deployments, and testing environments, especially during periods of high churn.
We’re also constantly innovating on IaaS and GKE capabilities to make Google Cloud the best place to build your AI workloads. Recent innovations in this space include:
Secondary boot disk, which provides faster workload startups through container image caching
Custom compute classes, which offer greater control over compute resource allocation and scaling
Support for Trillium, our sixth-generation TPU, the most performant and most energy-efficient TPU to date
Support for A3 Ultra VM powered by NVIDIA H200 Tensor Core GPUs with our new Titanium ML network adapter, which delivers non-blocking 3.2 Tbps of GPU-to-GPU traffic with RDMA over Converged Ethernet (RoCE). A3 Ultra VMs will be available in preview next month.
A continued commitment to open source
Guided by Google’s long-standing and robust open-source culture, we make substantial contributions to the open-source community, including when it comes to scaling Kubernetes. With support for 65,000-node clusters, we made sure that all necessary optimizations and improvements for such scale are part of the core open-source Kubernetes.
Our investments to make Kubernetes the best foundation for AI platforms go beyond scalability. Here is a sampling of our contributions to the Kubernetes project over the past two years:
Incubated the K8S Batch Working Group to build a community around research, HPC and AI workloads, producing tools like Kueue.sh, which is becoming the de facto standard for job queueing on Kubernetes
Created the JobSet operator that is being integrated into the Kubeflow ecosystem to help run heterogenous jobs (e.g., driver-executer)
For multihost inference use cases, created the Leader Worker Set controller
Published a highly optimized internal model server of JetStream
Incubated the Kubernetes Serving Working Group, which is driving multiple efforts including model metrics standardization, Serving Catalog and Inference Gateway
At Google Cloud, we’re dedicated to providing the best platform for running containerized workloads, consistently pushing the boundaries of innovation. These new advancements allow us to support the next generation of AI technologies. For more, listen to the Kubernetes podcast, where Maciek Rozacki and Wojtek Tyczynski join host Kaslin Fields to talk about GKE’s support for 65,000 nodes. You can also see a demo on 65,000 nodes on a single GKE cluster here.
Rapidly evolving generative AI models place unprecedented demands on the performance and efficiency of hardware accelerators. Last month, we launched our sixth-generation Tensor Processing Unit (TPU), Trillium, to address the demands of next-generation models. Trillium is purpose-built for performance at scale, from the chip to the system to our Google data center deployments, to power ultra-large scale training.
Today, we present our first MLPerf training benchmark results for Trillium. The MLPerf 4.1 training benchmarks show that Trillium delivers up to 1.8x better performance-per-dollar compared to prior-generation Cloud TPU v5p and an impressive 99% scaling efficiency (throughput).
In this blog, we offer a concise analysis of Trillium’s performance, demonstrating why it stands out as the most performant and cost-efficient TPU training system to date. We begin with a quick overview of system comparison metrics, starting with traditional scaling efficiency. We introduce convergence scaling efficiency as a crucial metric to consider in addition to scaling efficiency. We assess these two metrics along with performance per dollar and present a comparative view of Trillium against Cloud TPU v5p. We conclude with guidance that you can use to make an informed choice for your cloud accelerators.
Traditional performance metrics
Accelerator systems can be evaluated and compared across multiple dimensions, ranging from peak throughput, to effective throughput, to throughput scaling efficiency. Each of these metrics are helpful indicators but do not take convergence time into consideration.
Hardware specifications and peak performance
Traditionally, comparisons focused on hardware specifications like peak throughput, memory bandwidth, and network connectivity. While these peak values establish theoretical boundaries, they are bad at predicting real-world performance, which depends heavily on architectural design and software implementation. Since modern ML workloads typically span hundreds or thousands of accelerators, the key metric is the effective throughput of an appropriately sized system for specific workloads.
Utilization performance
System performance can be quantified through utilization metrics like effective model FLOPS utilization (EMFU) and memory bandwidth utilization (MBU), which measure achieved throughput versus peak capacity. However, these hardware efficiency metrics don’t directly translate to business-value measures like training time or model quality.
Scaling efficiency and trade-offs
A system’s scalability is evaluated through both strong scaling (performance improvement with system size for fixed workloads) and weak scaling (efficiency when increasing both workload and system size proportionally). While both metrics are valuable indicators, the ultimate goal is to achieve high-quality models quickly, sometimes making it worthwhile to trade scaling efficiency for faster training time or better model convergence.
The need for convergence scaling efficiency
While hardware utilization and scaling metrics provide important system insights, convergence scaling efficiency focuses on the fundamental goal of training: reaching model convergence efficiently. Convergence refers to the point where a model’s output stops improving and the error rate becomes constant. Convergence scaling efficiency measures how effectively additional computing resources accelerate the training process to completion.
We define convergence scaling efficiency using two key measurements: the base case, where a cluster of N₀ accelerators achieves convergence in time T₀, and a scaled case with N₁ accelerators taking time T₁ to converge. The ratio of the speedup in convergence time to the increase in cluster size gives us:
A convergence scaling efficiency of 1 indicates that time-to-solution improves by the same ratio as the cluster size. It is therefore desirable to have convergence scaling efficiency as close to 1 as possible.
Now let’s apply these concepts to understand our ML Perf submission for GPT3-175b training task using Trillium and Cloud TPU v5p.
Trillium performance
We submitted GPT3-175b training results for four different Trillium configurations, and three different Cloud TPU v5p configurations. In the following analysis, we group the results by cluster sizes with the same total peak flops for comparison purposes. For example, the Cloud TPU v5p-4096 configuration is compared to 4xTrillium-256, and Cloud TPU v5p-8192 is compared with 8xTrillium-256, and so on.
All results presented in this analysis are based on MaxText, our high-performance reference implementation for Cloud TPUs and GPUs.
Weak scaling efficiency
For increasing cluster sizes with proportionately larger batch-sizes, both Trillium and TPU v5p deliver near linear scaling efficiency:
Figure-1: Weak scaling comparison for Trillium and Cloud TPU v5p. v5p-4096 and 4xTrillium-256 are considered as base for scaling factor measurement. n x Trillium-256 corresponds to n Trillium pods with 256 chips in one ICI domain. v5p-n corresponds to n/2 v5p chips in a single ICI domain.
Figure 1 demonstrates relative throughput scaling as cluster sizes increase from the base configuration. Trillium achieves 99% scaling efficiency even when operating across data-center networks using Cloud TPU multislice technology, outperforming the 94% scaling efficiency of Cloud TPU v5p cluster within a single ICI domain. For these comparisons, we used a base configuration of 1024 chips (4x Trillium-256 pods), establishing a consistent baseline with the smallest v5p submission (v5p-4096; 2048 chips). When measured against our smallest submitted configuration of 2x Trillium-256 pods, Trillium maintains a strong 97.6% scaling efficiency.
Convergence scaling efficiency
As stated above, weak scaling is useful but not a sufficient indicator of value, while convergence scaling efficiency brings time-to-solution into consideration.
Figure-2: Convergence scaling comparison for Trillium and Cloud TPU v5p.
For the largest cluster size, we observed comparable convergence scaling efficiency for Trillium and Cloud TPU v5p. In this example, a CSE of 0.8 means that for the rightmost configuration, the cluster size was 3x the (base) configuration, while the time to convergence improved by 2.4x with respect to the base configuration (2.4/3 = 0.8).
While the convergence scaling efficiency is comparable between Trillium and TPU v5p, where Trillium really shines is by delivering the convergence at a lower cost, which brings us to the last metric.
Cost-to-train
While weak scaling efficiency and convergence scaling efficiency indicate scaling properties of systems, we’ve yet to look at the most crucial metric: the cost to train.
Figure-3: Comparison of cost-to-train based on the wall-clock time and the on-demand list price for Cloud TPU v5p and Trillium.
Trillium lowers the cost to train by up to 1.8x (45% lower) compared to TPU v5p while delivering convergence to the same validation accuracy.
Making informed cloud accelerator choices
In this article, we explored the complexities of comparing accelerator systems, emphasizing the importance of looking beyond simple metrics to assess true performance and efficiency. We saw that while peak performance metrics provide a starting point, they often fall short in predicting real-world utility. Instead, metrics like Effective Model Flops Utilization (EMFU) and Memory Bandwidth Utilization (MBU) offer more meaningful insights into an accelerator’s capabilities.
We also highlighted the critical importance of scaling characteristics — both strong and weak scaling — in evaluating how systems perform as workloads and resources grow. However, the most objective measure we identified is the convergence scaling efficiency, which ensures that we’re comparing systems based on their ability to achieve the same end result, rather than just raw speed.
Applying these metrics to our benchmark submission with GPT3-175b training, we demonstrated that Trillium achieves comparable convergence scaling efficiency to Cloud TPU v5p while delivering up to 1.8x better performance per dollar, thereby lowering the cost-to-train. These results highlight the importance of evaluating accelerator systems through multiple dimensions of performance and efficiency.
For ML-accelerator evaluation, we recommend a comprehensive analysis combining resource utilization metrics (EMFU, MBU), scaling characteristics, and convergence scaling efficiency. This multi-faceted approach enables you to make data-driven decisions based on your specific workload requirements and scale.
Every November, we start sharing forward-looking insights on threats and other cybersecurity topics to help organizations and defenders prepare for the year ahead. The Cybersecurity Forecast 2025 report, available today, plays a big role in helping us accomplish this mission.
This year’s report draws on insights directly from Google Cloud’s security leaders, as well as dozens of analysts, researchers, responders, reverse engineers, and other experts on the frontlines of the latest and largest attacks.
Built on trends we are already seeing today, the Cybersecurity Forecast 2025 report provides a realistic outlook of what organizations can expect to face in the coming year. The report covers a lot of topics across all of cybersecurity, with a focus on various threats such as:
Attacker Use of Artificial Intelligence (AI): Threat actors will increasingly use AI for sophisticated phishing, vishing, and social engineering attacks. They will also leverage deepfakes for identity theft, fraud, and bypassing security measures.
AI for Information Operations (IO): IO actors will use AI to scale content creation, produce more persuasive content, and enhance inauthentic personas.
The Big Four: Russia, China, Iran, and North Korea will remain active, engaging in espionage operations, cyber crime, and information operations aligned with their geopolitical interests.
Ransomware and Multifaceted Extortion: Ransomware and multifaceted extortion will continue to be the most disruptive form of cyber crime, impacting various sectors and countries.
Infostealer Malware: Infostealer malware will continue to be a major threat, enabling data breaches and account compromises.
Democratization of Cyber Capabilities: Increased access to tools and services will lower barriers to entry for less-skilled actors.
Compromised Identities: Compromised identities in hybrid environments will pose significant risks.
Web3 and Crypto Heists: Web3 and cryptocurrency organizations will increasingly be targeted by attackers seeking to steal digital assets.
Faster Exploitation and More Vendors Targeted: The time to exploit vulnerabilities will continue to decrease, and the range of targeted vendors will expand.
Be Prepared for 2025
Read the Cybersecurity Forecast 2025 report for a more in-depth look at these and other threats, as well as other security topics such as post-quantum cryptography, and insights unique to the JAPAC and EMEA regions.
For an even deeper look at the threat landscape next year, register for our Cybersecurity Forecast 2025 webinar, which will be hosted once again by threat expert Andrew Kopcienski.
For even more insights, hear directly from our security leaders: Charles Carmakal, Sandra Joyce, Sunil Potti, and Phil Venables.
Have you heard of the monkey and the pedestal? Astro Teller, the head of Google’s X “moonshot factory,” likes to use this metaphor to describe tackling the biggest challenge first, despite being tempted by the endorphin boost of completing more familiar tasks.
It’s a challenge startups know well. When you’re re-inventing the industry standard, it’s all about failing fast. You’re looking for the quickest way to get to a “no” so you’re another step closer to reaching a “yes.” Every day you gain back from abandoning trivial features in favor of focusing on the biggest challenge becomes a day closer to your goal.
Fortunately, AI is not only playing an increasing role in the offerings of startups but also how they build those offerings, accelerating their execution and giving them new insights to act faster and iterate better.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e1b6ebe9820>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
What’s the fastest way you’re going to get your product launched? Piecing together data across your front and back ends in yet another platform only creates latency and poor user experience. Many of the successful funded gen AI startups — more than 60% of whom are building on Google Cloud — are using Vertex AI as the development host and productionalize backbone to accelerate innovation. In this moment of rapid transformation, every day matters.
Our mission at Google Cloud is to support ambitious startups, like the three profiled below who are driving innovation in customer service, healthcare research, and identity verification. Abstrakt, NextNet, and Ferret are among the long list of startups using Google Cloud’s AI-optimized infrastructure and Vertex AI platform to accelerate their innovation.
NextNet
NextNet is a specialized search engine for life sciences and pharmaceutical researchers that uses AI to analyze vast amounts of biomedical data. Leveraging Google Cloud Vertex AI and Gemini, it identifies hidden relationships and patterns within scientific literature, allowing researchers to ask complex questions in plain language and receive accurate answers. This accelerates research and drives innovation in medicine by facilitating a deeper understanding of complex biomedical information.
Specifically, NextNet uses Gemini for natural language processing and knowledge extraction, outperforming other commercial AI models in this domain. It also utilizes Vertex AI and other managed services to efficiently develop SaaS offerings and scale its knowledge base.
“Gemini, as a production platform, has been incredibly useful and allowed us to evaluate scientific research with subtlety and clarity,” Steven Banerjee, the CEO of NextNet, said. “On our specific language tasks, Gemini has equaled or outperformed other commercial AI models. We are extracting scientific insights now that would not have been possible 12 or 18 months ago. And the iteration speed of Google’s generative models has meant that we are staying state of the art.”
Abstrakt
Abstrakt focuses on enhancing contact center customer experiences through the use of generative AI. They leverage Google Cloud’s robust infrastructure and the Vertex AI suite to transcribe calls in real-time while simultaneously evaluating sentiment.
Their mission is to empower teams to have more meaningful and effective conversations with customers in real time, helping both call center workers and their customers resolve issues faster, so even more can get the help they need. Abstrakt aims to achieve this by providing instantaneous guidance and insights during calls, transparent progress tracking, and AI-guided coaching, leading to continued improvement for workers and customers alike.
Ferret.ai
Ferret.ai is using AI to offer transparent insights about the backgrounds of people in your personal and professional network. In a world where reputational risks seem to be growing and rarely go away thanks to digital “receipts,” Ferret is using world-class global data alongside AI to provide a curated relationship intelligence and monitoring solution to help avoid risk and identify opportunities.
The unique platform built by Ferret.ai pieces together information and finds patterns by using generative AI to analyze information, verify the source, assess its credibility, and achieve contextual understanding that identifies sentiment. They also use pattern recognition to analyze vast datasets to uncover potential red flags or inconsistencies that could be missed by human analysts. This is valuable for investors, businesses, and individuals who want to avoid scams, make smart partnerships, and ensure their safety.
Faster innovation, faster time to market
These founders saw significant pain points and directed all of their resources to solving these problems for their customers. Deploying packaged back-end solutions, like Vertex AI’s unified development platform, benefited their speed to market. When Google Cloud takes care of model accuracy and performance, you’re freed up to own what you do best.
Your needs as a startup can evolve quickly based on the dynamics of the market. Importantly our open ecosystem of models and APIs offer flexibility as you adapt and grow.
Go tackle your biggest challenges and let Google Cloud provide you with the most secure, fast, scalable platform so you can focus on the solutions that matter most to your users. For help getting started, you can apply for the Google Cloud for Startups Program or reach out to one of our startup specialists today.
At Google Cloud, we’re fortunate to partner with organizations that employ some of the world’s most talented and innovative professionals. Together, we’re reshaping industries, driving customer success, and pushing the boundaries of what’s possible. Our partners are more than collaborators — they’re the change-makers defining the future of business.
The Google Cloud Partner All-stars program celebrates these remarkable people. Each year, we recognize those who go above and beyond, leading with passion, innovation, and a commitment to excellence. These are the people driving our industry forward, and we’re thrilled to honor them for 2024.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e4b65a4b9a0>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
2024 Spotlight: Artificial Intelligence
For 2024, we’re excited to introduce a new category that highlights the power and potential of Artificial Intelligence (AI). As AI redefines the business and technology landscape, we’re proud to recognize those who are not just using AI, but actively shaping its future.
The Artificial Intelligence category honors those visionary leaders spearheading AI initiatives with bold ideas, experimentation, and ethical stewardship. They’re bringing AI from concept to reality, unlocking new possibilities, and driving meaningful results for their clients. These Partner All-stars are building the future, one breakthrough at a time.
What sets Partner All-stars apart?
The following attributes define the standout qualities of a Partner All-star:
Artificial Intelligence
Provides a clear vision for AI’s transformative potential in the business
Champions AI initiatives by securing resources, driving adoption, and promoting collaboration
Leads experimentation with AI, generating innovative solutions and tangible results for clients
Demonstrates a commitment to ethical AI practices, ensuring responsible and fair use
Delivery excellence
Top-ranked individuals on Google Cloud’s Delivery Readiness Portal (DRP)
Demonstrates commitment to technical excellence by passing advanced delivery challenge labs and other advanced technical training
Demonstrates excellent knowledge and adoption of Google Cloud delivery enablement methodologies, assets, and offerings
Exhibits expertise through customer project and deployment experience
Consistently meets and exceeds sales goals and targets
Aligns on shared goals to deliver amazing end-to-end customer experiences
Prioritizes long-term customer-relationship building over short-term selling
Marketing
Drives strategic programs and key events that address customer concerns and interests
Works across cross-functional teams to ensure the success of key campaigns and events
Takes a data-driven approach to marketing, investing time and resources in programs that drive the biggest impact
Always exploring areas of opportunity and improvement in order to uplevel future work
Sales
Demonstrates commitment to the customer transformation journey
Solutions engineering
Delivers superior customer experiences by keeping professional skills up to date, earning at least one Google technical certification
Embraces customer challenges head-on, taking responsibility for end-to-end solutioning
Works with purpose, providing deliverables in a timely manner while never compromising quality
Works effectively across joint product areas, leveraging technology in new and innovative ways to address customer needs
Celebrating excellence in 2024
On behalf of the entire Google Cloud team, I want to extend our heartfelt congratulations to the 2024 Google Cloud Partner All-stars, who we have notified of this distinction. Their dedication, innovation, and leadership continue to inspire us and drive success for our customers.
Stay tuned as we celebrate this year’s Partner All-stars and join the conversation by following #PartnerAllstars across social media.
Leveraging first-party data, and data quality in general, are major priorities for online retailers. While first-party data certainly comes with challenges, it also offers a great opportunity to increase transparency, redefine customer interactions, and create more meaningful user experiences.
Here at PUMA, we’re already taking steps to seize the opportunities presented by signal loss as organizations embrace privacy-preserving technologies. Our motto “Forever.Faster.” isn’t just about athletic performance, it also describes our rapid response to market changes. In that aim, we’re partnering with Google Cloud to leverage the capabilities of machine learning (ML) for greater customer engagement via advanced audience segmentation.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e347e68b700>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
Moving from manual segmentation to advanced audiences
In August 2022 we decided to test Google Cloud’s machine-learning capabilities to create advanced audiences based on high purchase propensity with different data sets in BigQuery. While Google Analytics offers predictive audiences, we used this pilot to build a custom ML model tailored to our specific needs, deepening our expertise and giving us more control over the underlying data. Designing our own machine learning model allows us to analyze and extract valuable insights from first-party data, enable predictive analytics, and attribute conversions and interactions to the right touchpoints.
The core products used in the process included Cloud Shell for framework setup, Instant BQML as the quick start tool for audience configuration, CRMint for orchestration, and BigQuery for advanced analytics capabilities. The modeling and machine-learning occur within BigQuery while CRMint aids in data integration and audience creation within Google Analytics. When Google Analytics is linked to Google Ads, audience segments are shared automatically with Google Ads where they can be activated in a number of strategic ways.
The Google Cloud and gTech Ads teams worked closely with us throughout the set-up and deployment, which was fast and efficient. Generally speaking, we were impressed with the support we received throughout the process, which was highly collaborative from initiation to execution. The Google teams offered guidance and resources throughout, and their support enabled us to leverage the advanced analytics capabilities of BigQuery to build our own predictive audience model and identify the users most likely to make a purchase. We also appreciated the amount of available documentation, which made things much easier for our developers.
Engaging the right users with advanced analytics
This was one of the first ML marketing analytics use cases at PUMA, and it turned out to be a very positive experience. Within the first six months, the click-through rate (CTR) of our advanced audience segments was significantly higher compared to other website visitor audiences or any other audience.
Among the 10 designated audiences, the top three showed a 149.8% increase in click-through rate compared to other audiences used for advertising. Additionally, we observed a 4.6% increase in conversion rate and a 6% increase in average order value (AOV) compared to the previous setup.
In addition to these results, which are helping us take steps towards increasing revenue, the new solutions are also enabling us to optimize and predict costs. Pricing is well structured, flexible, and transparent, and we can easily identify exactly where we’re spending money.
We’re looking forward to continuing to partner with Google Cloud as we work to adapt our advertising strategy to signal loss, which has been happening for years.
Our next step is to explore the development of advanced audiences using PUMA’s internal data, such as offline purchase information or other data not captured by Google Ads or Google Analytics. This opens up new opportunities to reach consumers we’re currently missing, while expanding the size of our audiences. At the same time, we’ll be scaling advanced audiences to all of our 20+ international entities.
We’re also exploring server-side tagging using Tag Manager and in one market, we’re also experimenting with real-time reporting based on server-side data collection, with promising results so far.
We’re looking to implement an event-driven architecture leveraging Google Cloud’s services, which is part of a broader strategy to reorganize and better structure our data-management processes to better support and operationalize AI use cases for both our organization and customers.
This project has opened our eyes to the possibilities of data-driven, machine learning automated audience creation. Added to this, the fact that it was so easy to deploy has bolstered our confidence when it comes to machine-learning projects in general. We look forward to a long-term partnership with Google Cloud and are excited to see where the future will take us.
In today’s data- and AI-driven world, organizations are grappling with an ever-growing volume of structured and unstructured data. This growth makes it increasingly challenging to locate the right data at the right time, and a significant portion of enterprise data remains undiscovered or underutilized — what’s often referred to as “dark data.” In fact, a staggering 66% of organizations report that at least half of their data falls into this category.
To address this challenge, today we’re announcing automatic discovery and cataloging of Google Cloud Storage data with Dataplex, part of BigQuery’s unified platform for intelligent data to AI governance. This powerful capability empowers organizations to:
Automatically discover valuable data assets residing within Cloud Storage, including structured and unstructured data such as documents, files, PDFs, images, and more.
Harvest and catalog metadata for your discovered assets by keeping schema definitions up-to-date with built-in compatibility checks and partition detection, as data evolves.
Enable analytics for data science and AI use cases at scale with auto-created BigLake, external or object tables, eliminating the need for data duplication or manually creating table definitions.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3e44b8032520>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
How it works
The automatic discovery and cataloging process in Dataplex is designed to be integrated and efficient, and performs the following steps:
Discovery scan: Discovery scan is configured by the user using the BigQuery Studio UI, CLI or gcloud, which scans your Cloud Storage bucket with up to millions of files, identifying and classifying data assets.
Metadata extraction: Relevant metadata, including schema definitions and partition information, is extracted from the discovered assets.
Creation of dataset and tables in BigQuery: A new dataset with numerous BigLake, external or object tables (for unstructured data) is automatically created in BigQuery with accurate, up-to-date table definitions. For scheduled scans, these tables will be updated as the data in cloud storage bucket evolves.
Analytics and AI preparation: The published dataset and tables are available for analysis, processing, data science, and AI use cases in BigQuery, as well as open-source engines like Spark, Hive, and Pig.
Catalog integration: All BigLake tables are integrated into the Dataplex catalog, making them easily searchable and accessible.
Key benefits
Dataplex’s automatic discovery and cataloging feature offers a multitude of benefits for organizations:
Enhanced data visibility: Gain a clear understanding of your data and AI assets across Google Cloud, eliminating the guesswork and reducing the time spent searching for relevant information.
Reduced manual effort: Cut back on the toil and effort of creating table definitions manually by letting Dataplex scan the bucket and create numerous BigLake tables that correspond to your data in Cloud Storage.
Accelerated analytics and AI: Integrate the data that’s discovered into your analytics and AI workflows, unlocking valuable insights and driving informed decision-making.
Simplified data access: Provide authorized users with easy access to the data they need, while maintaining appropriate security and control measures.
Automatic discovery and cataloging in Dataplex marks a significant step forward in helping organizations unlock the full potential of their data. By eliminating the challenges associated with dark data and providing a comprehensive, searchable catalog of your Cloud Storage assets, Dataplex empowers you to make data-driven decisions with confidence.
We encourage you to explore this powerful new feature and experience the benefits firsthand. To learn more and get started, please visit the Dataplex documentation or contact our team for assistance.
At Google Cloud, we recognize that helping customers and government agencies keep tabs on vulnerabilities plays a critical role in securing consumers, enterprises, and software vendors.
We have seen the Common Vulnerabilities and Exposure (CVE) system evolve into an essential part of building trust across the IT ecosystem. CVEs can help users of software and services identify vulnerabilities that require action, and they have become a global, standardized tracking mechanism that includes information crucial to identifying and prioritizing each vulnerability.
As part of our continued commitment to security and transparency on vulnerabilities found in our products and services,effective todaywe will be issuing CVEs for critical Google Cloud vulnerabilities, even when we do not require customer action or patching.
To help users easily recognize that a Google Cloud vulnerability does not require customer action, we will annotate the CVE record with the “exclusively-hosted-service” tag. No action is required by customers in relation to this announcement at this time.
”Transparency and shared action, to learn from and mitigate whole classes of vulnerability, is a vital part of countering bad actors. We will continue to lead and innovate across the community of defenders,” said Phil Venables, CISO, Google Cloud.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud security products’), (‘body’, <wagtail.rich_text.RichText object at 0x3e44b80159d0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Our commitment to vulnerability transparency
The Cyber Safety Review Board (CSRB) has found that a lack of a strong commitment to security creates preventable errors and serious breaches, a serious concern for major platform providers who have a responsibility to advance security best practices. We can see why the CSRB emphasized best practices for cloud service providers in its report on Storm-0558 detailing how the APT group used forged authentication tokens to gain access to email accounts for around 25 organizations, including government agencies.
By partnering with the industry through programs including Cloud VRP, and driving visibility on vulnerabilities with CVEs, we believe we are advancing security best practices at scale. CVEs are publicly disclosed and can be used by anyone to track and identify vulnerabilities, which has helped our customers to understand their security posture better. Ultimately, issuing CVEs helps us build your trust in Google Cloud as a secure cloud partner for your enterprise and business needs.
As we noted in our Secure By Design paper, Google has a 20-year history of collaborating with external security researchers, whose independent work discovering vulnerabilities has been helpful to Google. Our vulnerability reporting process encourages direct engagement as part of our community-based approach to addressing security concerns.
This same community-focused journey took us down the path of launching our first CVE Numbering Authority in 2011. Since then, we’ve issued more than 8,000 CVEs across our consumer and enterprise products. We’ve since expanded our partnership with MITRE, and Google became one of their four Top-Level Roots in 2022.
Today’s announcement marks an important step Google Cloud is making to normalize a culture of transparency around security vulnerabilities, and aligns with our shared fate model, in which we work with our customers to continuously improve security.
While the Google Cloud VRP has a specific focus on strengthening Google Cloud products and services, and brings together our engineers with external security researchers to further the security posture for all our customers, CVEs enable us to help our customers and security researchers track publicly-known vulnerabilities.
Earlier this year, Google Cloud launched the highly anticipated C4 machine series, built on the latest Intel Xeon Scalable processors (5th Gen Emerald Rapids), setting a new industry-leading performance standard for both Google Compute Engine (GCE) and Google Kubernetes Engine (GKE) customers. C4 VMs deliver exceptional performance improvements and have been designed to handle your most performance-sensitive workloads delivering up to a 25% price-performance improvement over the previous generation general-purpose VMs, C3 and N2.
C4 VMs are already delivering impressive results for businesses. Companies like Verve, which is a creator of digital advertising solutions, are already integrating C4 into their core infrastructure; in Verve’s case, they’re seeing remarkable results with a 37% improvement in performance. For Verve, C4 isn’t only about better performance — it’s actually fueling their revenue growth.
Read on to discover how Verve leveraged C4 to achieve this success, including their evaluation process and the key metrics that demonstrate C4’s impact on their business.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e44b8003220>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Verve’s Challenge and Business Opportunity
Verve delivers digital ads across the internet with a platform that connects ad buyers to other ad-delivery platforms, as well as allows these advertisers to bid on ad space through a proprietary real-time auction platform. Real-time is the key here, and it’s also why C4 has made such a big impact on their business.
A marketplace for ad bidding is an incredibly latency and performance-sensitive workload. About 95% of the traffic hitting their marketplace, which runs on GKE, is not revenue generating, because the average ad fill-rate is only 5-7% of bids.
It takes a lot of cloud spend to fill bid requests that never generate revenue, and so any increase in performance or reduction in latency can have a tremendous impact on their business. In fact, the more performance Verve can get out of GKE, the more revenue they generate because the fill-rate for ads (successful bid/ask matching) grows exponentially.
Fast Facts on Verve and their ad-platform:
Verve’s GKE Architecture and C4 Evaluation Plan
Verve’s marketplace ran on N2D machines leveraging an Envoy-based reverse proxy (Countour) for ingress and egress. Verve is handling a high volume of traffic, with hundreds of millions of events daily (clicks, impressions, actions, in-app events, etc.).
This means they need to be able to scale their servers fast to handle traffic spikes and control who has access to our servers and with which permissions. Verve has built its infrastructure on top of Kubernetes to allow elasticity and scalability, and they rely a lot on spot pricing to be cost effective.
To setup the benchmark, Verve ran a canary, meaning one pod of the main application per node type, and measured two values, one related to performance exported from the application, vCPU per ad request 99th percentile in ms, and one related to spot price, which is given by the total compute price (vCPU + GB RAM):
Leveraging GKE Gateway to Save Costs and Improve Latency
Verve needs to scale their servers fast to handle traffic spikes with the lowest latency possible and rapid scalability, and for this they rely on Google GKE Gateway which leverages Google’s Envoy-based global load balancers.Their solution optimizes real-time bidding for ads, boosting revenue through improved response times and efficient cost management in a market where latency is correlated to bids and revenue, somewhat similar to High-Frequency Trading (HFT) in financial markets.
By migrating to GKE Gateway, Verve managed to improve its Total Cost of Ownership (TCO). Google only charges for traffic going through the Gateway, so Verve saw significant compute cost savings by not having to spin up GKE nodes for the proxies. Also, Verve saw a notable reduction in the burden of maintaining this GKE Gateway-based solution compared to an Ingress-based solution, which impacted their TCO. The cherry on top of it all is they saw improved latencies by 20-25% in the traffic itself and this generated 7.5% more revenue.
Saving Costs While Achieving Better Performance with Custom Compute Classes
Anticipating their high season, Verve worked with their GCP Technical Account Manager to get onboarded in the Early Access Program of Custom Compute Classes, a new feature which Verve had been eagerly anticipating for years.
Custom Compute Classes (CCC) is a Kubernetes-native, declarative API that can be used to define fallback priorities for autoscaled nodes in case a top priority is unavailable (e.g. a spot VM). It also has an optional automatic reconciliation feature which can move workloads to higher priority node shapes if and when they become available.
This lets GKE customers define a prioritized list of compute preferences by key metrics like price/performance, and GKE automatically handles scale-up and consolidation onto the best options available at any time. Verve is using CCCs to help establish C4 as their preferred machine, but they also use it to specify other machine families to maximize their obtainability preferences.
Pablo Loschi, Principal Systems Engineer at Verve, was impressed with the versatility his team was able to achieve:
“With Custom Compute Classes,” Loschi said, “we are closing the circle of cost-optimization. Based on our benchmarks, we established a priority list of spot machine types based on price/performance, and CCC enables us to maximize obtainability and efficiency by providing fall-back compute priorities as a list of preferred machines. We love how when out-of-stock machines become available again CCC reconciles to preferential infrastructure, finally eliminating the false dichotomy of choosing between saving costs and machine availability, even in the busy season”
Verve’s Results and Business Impact
Verve benchmarked their marketplace running on GKE across several GCE machines. Today their marketplace runs on N2D machines, and by switching to C4 they saw a 37% improvement in performance.
They also switched from a self-managed Contour-Envoy proxy to GKE Gateway, which saw a dramatic improvement in latency of 20% to 25%, which translated into 7.5% more revenue since more bids are auctioned. GKE Gateway also allowed them to save a lot of compute costs because the load balancer doesn’t charge per compute but only per network. Additionally, they benefited from reduced manual burden of managing, updating, and scaling this solution.
“We were able to directly attribute the reduced latency to revenue growth — more bids are being accepted because they are coming faster,” Ken Snider, Verve VP of Cloud Infrastructure, said.
The combination of switching to C4 and GKE Gateway is driving their business’ revenue growth. “We started on a path a year ago talking with the product team from Google to help solve this problem, and now we are seeing it come together,” Snider said.
The next phase for Verve’s optimization journey is to improve their compute utilization, ensuring maximal usage of all deployed GKE nodes. GKE features such as Node Autoprovisioning and Custom Compute Classes will continue to play an important role in his team’s efforts in driving top-line growth for the business while being good stewards of their cloud costs.
C4 Brings Unparalleled Performance
C4 VMs are built on the latest Intel Xeon Scalable processors (5th Gen Emerald Rapids), delivering a significant performance leap for mission-critical and performance-sensitive workloads such as databases, gaming, financial modeling, data analytics, and inference.
Leveraging Google’s custom-designed Titanium infrastructure, C4 VMs provide high bandwidth and low latency networking for optimal performance with up to 200 Gbps of bandwidth, as well as high-performance storage with Hyperdisk. With C4, storage is offloaded to the Titanium adapter, reserving the host resources for running your workloads. And by leveraging hitless upgrades and live migration, the vast majority of infrastructure maintenance updates are performed with near-zero impact to your workloads, minimizing disruptions and providing predictable performance. For real-time workloads, C4 offers up to 80% better CPU responsiveness compared to previous generations, resulting in faster trades and a smoother gaming and streaming experience.
But C4 offers more than just powerful hardware; it’s a complete solution for performance-critical workloads. C4 VMs integrate seamlessly with Google Kubernetes Engine (GKE), enabling you to easily deploy and manage containerized applications at scale.
A range of machine types with varying vCPU and memory configurations are available to match your specific needs. And with its superior price-performance, C4 VMs deliver exceptional value, helping you optimize your cloud spend without compromising on performance.
If you run Google Kubernetes Engine (GKE), you know it’s important to secure access to the cluster control plane that handles Kubernetes API requests, so you can prevent unauthorized access while still being able to control the cluster.
Previously, GKE provided two primary methods to secure the control plane: authorized networks and disabling public endpoints. But when using these methods, it can be difficult to access the cluster. You need creative solutions such as bastion hosts to gain access via the cluster’s private network, and the list of authorized networks must be kept up to date across all clusters.
Today, we are excited to announce a new DNS-based endpoint for GKE clusters, allowing enhanced flexibility in access methods and security controls. The DNS-based endpoint is available today on every cluster, regardless of version or cluster configuration.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e44b811ac10>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
The new DNS-based endpoint addresses several of the current challenges associated with Kubernetes control plane access, including:
Complex IP-based firewall / allowlist configurations: IP address-based authorized network configuration / ACL is prone to human configuration error.
Static configurations based on IP addresses: As network configuration and IP ranges change, you need to change the authorized network IP Firewall configuration accordingly.
Proxy / bastion hosts: When accessing the GKE control plane from a remote network, different cloud location, or from a VPC that is different from the VPC where the cluster resides, you need to set up a proxy or bastion host.
These challenges have resulted in a complex configuration and a confusing user experience for GKE customers.
Introducing a new DNS-based endpoint
The DNS name resolves to a frontend that is accessible from any network that can reach Google Cloud APIs, including VPC networks, on-premises networks, or other cloud networks. This frontend The new DNS-based endpoint for GKE provides a unique DNS or fully qualified domain name (FQDN) for each cluster control plane. applies security policies to reject unauthorized traffic, and then forwards traffic to your cluster.
This approach provides a number of benefits:
1. Simple flexible access from anywhere
Using the DNS-based endpoint eliminates the need for a bastion host or proxy nodes. Authorized users can access your control plane from different clouds, on-prem deployments or from home without jumping through proxies. With DNS-based endpoints, there are no restrictions for transiting multiple VPCs, as the only requirement is access to Google APIs. If desired, you can still limit access to specific networks using VPC Service Controls.
2. Dynamic security
Access to your control plane over the DNS-based endpoint is protected via the same IAM policies used to protect all GCP API access. Using IAM policies, you can ensure that only authorized users can access the control plane, irrespective of which IP or network they use. If needed, you can simply revoke a particular identity’s access without worrying about network IP address configuration and boundaries. IAM roles can be customized to suit your organization’s needs.
For more details on the exact permissions required to configure IAM roles policies and authentication tokens, see Customize your network isolation.
3. Two layers of security
In addition to IAM policies, you can also configure network-based controls with VPC Service Controls, providing a multi-layer security model for your cluster control plane. VPC Service Controls adds context-aware access controls based on attributes such as network origin. You can achieve the same level of security as a private cluster that can only be accessed from a VPC network. VPC Service Controls are used by all Google Cloud APIs, aligning the security configuration of your clusters with your services and data hosted in all other Google Cloud APIs. You can make strong guarantees about preventing unauthorized access to services and data for all Google Cloud resources in a project. VPC Service Controls integrate with Cloud Audit Logs to monitor access to the control plane.
In our customer’s voice:
“When using private IP based control plane access we had to configure and manage a complex networking solution to be able to access the GKE control plane from our various VPC and data center sites. DNS-based GKE control plane access, in partnership with VPC Service Controls, will greatly simplify our access to the GKE control plane and help us implement a dynamic security based access policy based on Identity and Authentication.” – Keith Ferguson – ANZx Cloud & Engineering Platforms Technology Lead, ANZ Bank
How to configure DNS-based access
Configuring DNS-based access for the GKE cluster control plane is a straightforward process. Check out the following steps.
1. Enable the DNS-based endpoint
Enable DNS-based access for a new cluster with the following command:
Access to the control plane requires requests to be authenticated with a role with the new IAM permission container.clusters.connect. Assign one of the following IAM roles to your user:
roles/container.developer
roles/container.viewer
Here is an example of configuring a user to access cluster with pre-configured roles:
Alternatively, you can configure a custom IAM role with the container.clusters.connect permission.
Here’s an example of configuring a role to use container.clusters.connect:
3. Ensure your client can access Google APIs
If your client is connecting from a Google VPC, you will need to ensure that it has connectivity to Google APIs. One way to do this is to activate Private Google Access, which allows clients to connect to Google APIs without going over the public internet. Private Google Access is configured on individual subnets.
Below, you can see an example of turning on Private Google Access in a subnet:
4. [Optional] Configuring access via Private Service Connect for Google APIs
Cluster’s DNS endpoint can be accessible via Private Service Connect for Google APIs endpoint used for accessing the rest of the Google APIs. The Access Google APIs through endpoints page has all required steps to configure Private Service Connect for Google APIs endpoint. Accessing cluster’s DNS via custom endpoint is not supported, therefore as explained in use an endpoint section, creating an A record between “* gke.goog” and private IP assigned to Private Service Connect for Google APIs and creating a CNAME to “*.gke.goog” is required to make it work.
Try out DNS-based access
Now you’re ready to try out DNS-based access. Use the following command to generate a kubeconfig file using the cluster’s DNS address:
Then, use kubectl to access your cluster. You can use this directly from Cloud Shell to access clusters without a public IP endpoint, something that required setting up a proxy before.
Additional security with VPC Service Controls
Optionally, you can also use VPC Service Controls as an additional layer of security for your control plane access.
Here’s an example of an ingress policy that only allows requests from VMs in a specific GCP project:
What about the IP-based endpoint?
Of course, you can still use the existing IP-based endpoint to access the control plane, allowing you to try out DNS-based access without affecting your existing clients. Then, once you’re happy with DNS-based access, you can disable IP-based access, which provides additional security and simplified cluster management:
With DNS-based endpoints, you gain increased flexibility in managing the security of your cluster control planes, while also reducing the complexity of accessing clusters from private networks.
To learn more about how to use DNS-based endpoints, check out these references:
The eleventh Flare-On challenge is now over! This year proved to be a tough challenge for the over 5,300 players, with only 275 completing all 10 stages. We had a blast making this contest and are happy to see it continue to be a world-wide phenomenon. Those that finished all stages this year may be eligible to receive this elite desk trophy to the envy of your coworkers and family.
We would like to thank the challenge authors individually for their great puzzles and solutions:
frog – Nick Harbour (@nickharbour)
checksum – Chuong Dong (@cPeterr)
aray – Jakub Jozwiak
FLARE Meme Maker 3000 – Moritz Raabe (@m_r_tz)
sshd – Christopher Gardner (@t00manybananas)
bloke2 – Dave Riley (@6502_ftw)
fullspeed – Sam Kim
Clearly Fake – Blas Kojusner (@bkojusner)
serpentine – Mustafa Nasser (@d35ha)
CATBERT Ransomware – Mark Lechtik (@_marklech_)
This year’s challenge hosted 5,324 registered users, with 3,066 of them solving at least one stage. The difficulty curve ended up smoother than last year’s, with a nice progression of people falling off at stages 5, 7, and 9. Coincidentally, based on finisher feedback those were also the consensus favorite challenges this year. Does that mean we should up the difficulty next time?
Last year Germany was far out ahead of the leaderboard with 19 finishers to 2nd place Singapore’s 15. This year Vietnam takes the lead with 21 finishers and the USA comes in second with 20.
All the binaries from this year’s challenge are now posted on the Flare-On website. Here are the solutions written by each challenge author:
Aible is a leader in generating business impact from AI in less than 30 days, helping teams use AI to extract enterprise value from raw enterprise data with solutions for customer acquisition, churn prevention, demand prediction, preventative maintenance, and more. After previously leveraging BigQuery’s serverless architecture to reduce analytics costs, Aible is now collaborating with Google Cloud to enable customers to build, train, and deploy generative AI models on their own data, securely and with confidence.
As awareness of the potential of generative AI expands in the market, the following key considerations have emerged:
Enabling enterprise-grade control: Organizations want to enable gen AI experiences on their enterprise data, but they also want to ensure they have control over their data, so it’s not inadvertently used to train AI models without their knowledge.
Minimizing and mitigating hallucinations: Another specific gen AI risk is the potential for models to hallucinate — generate non-factual or nonsensical content.
Empowering business users: While gen AI enables numerous enterprise use cases, some of the most valuable use cases focus on enabling and empowering business users to leverage gen AI models with as little friction possible.
Scaling gen AI use cases: Enterprises need a way to harvest and operationalize their most promising use cases at scale and set up consistent best practices and controls.
Most organizations have a low-risk tolerance when it comes to data privacy, policy, and regulatory compliance. At the same time, they don’t consider delaying gen AI adoption as a viable option due to market and competitive pressures, especially given its promise for driving transformation. As a result, Aible wanted an AI approach that a wide variety of enterprise users could adopt and adjust quickly to a rapidly evolving landscape — all while keeping customer data secure.
Aible decided to leverage Vertex AI, Google Cloud’s AI platform, to ensure customers have confidence and full control over how their data is used and accessed when developing, training, or fine-tuning AI models.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3eaf7fa4cf40>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
Enabling enterprise-grade controls
Google Cloud’s design approach means customer data is secure by default on day 1 without requiring customers to take any additional actions. Google AI products and services provide security and privacy directly in your Google Cloud tenant projects. For instance, Vertex AI Agent Builder, Enterprise Search, and Conversation AI can all access and use secure customer data in Cloud Storage, which you can further secure using customer-managed encryption keys (CMEK).
Aible’s Infrastructure as Code approach allows you to leverage all of the benefits of Google Cloud directly in your own projects in a matter of minutes. The end-to-end experience is completely secured in the Vertex AI Model Garden, whether you choose Google gen AI models like Gemini, third-party models from Anthropic and Cohere, or open models, such as LLama or Gemma.
Aible also worked with its customer advisory board, which includes Fortune 100 companies, to design a solution that can invoke third-party gen AI models without exposing proprietary data outside of Google Cloud. Instead of sending raw data to an external model, Aible only sends high-level statistics on clusters, and this information can be masked as required. For example, it might send counts and averages based on geography or product instead of sending raw sales data.
This leverages a privacy technique called k-anonymity, which ensures data privacy by never providing information on groups of individuals smaller than k. You can change the default setting of k; the higher the k, the more private the information transfer. If masking is applied, Aible changes the name of a variable like “Country” to “Variable A” and a value like “Italy” to “Value X,” making the data transfer even more secure.
Mitigating hallucination risk
With gen AI, it’s critical to mitigate and reduce the risk of hallucinations using grounding, retrieval augmented generation (RAG), and other techniques. As a Built with Google Cloud AI partner, Aible provides automated analysis to augment human-in-the-loop review processes, empowering human experts with appropriate tools that can scale beyond manual efforts.
One of the primary ways Aible helps reduce hallucinations is by using its auto-generated Information Model (IM) — an explainable AI that double checks gen AI responses and confirms facts based on the context represented in your structured enterprise data at scale to prevent wrong decisions.
Aible’s Information Model addresses hallucinations in two ways:
The IM grounds gen AI models in a relevant subset of information, which has been shown to help reduce hallucination.
Aible parses through gen AI outputs and compares it to millions of answers already known to the Information Model to double check each fact.
This is similar to Google Cloud’s grounding capabilities on Vertex AI, which enables you to connect models to verifiable sources of information like your enterprise documents or the Internet to anchor responses in specific data sources. If a fact is automatically confirmed, it’s highlighted in blue — “If it’s blue, it’s true.” You can also check a specific pattern or variable and view a corresponding chart generated solely by the Information Model.
The diagram below shows how Aible and Google Cloud work together to deliver an AI-first end-to-end serverless environment. As Aible uses BigQuery to analyze and run serverless queries across millions of variable combinations with high efficiency, it is able to analyze any size dataset. For example, one of Aible and Google Cloud’s joint Fortune 500 customers was able to automatically analyze more than 75 datasets, with 100 million rows of data across 150 million questions and answers. The total cost of that evaluation was just $80.
By leveraging Vertex AI, Aible also has access to Model Garden, which includes Gemini and other leading open-source and third-party models.This means Aible is able to access non-Google gen AI models while benefiting from additional security layers, such as k-anonymity and masking.
Aible never gets access to your data, which remains securely in your Google Cloud project, along with all feedback, reinforcement learning, and Low-Rank Adaptation (LoRA) data.
Empowering business users
Today, the standard gen AI journey starts with identifying a shortlist of use cases, identifying the relevant datasets, collecting reinforcement learning examples from business users, and developing a gen AI application or experience for each specific use case. Gen App Builder massively simplifies this process, automating many of the data science and development tasks. However, one key issue remains with this entire process — business users are not currently empowered to do these tasks themselves even though they are the best suited for determining viable use cases, providing feedback on output, and evaluating gen AI models.
To unlock the innovation potential of your business users, it’s imperative that they understand gen AI techniques, how to use gen AI tools, and best practices for prompting. While investing in training is a valid solution, the rise of the citizen data scientist has taken a decade at this point. Instead of training, Aible approached this problem with technology designed specifically for business users.
Aible’s ChatAible system replaces months of manual work, providing a new gen AI experience that allows business users to “just start chatting” with their enterprise data.
Aible worked with Google to iterate and discover the right Gemini and PaLM prompts for common prototype use cases, such as document summarization and question answering, and built them into Chat Templates. These templates automatically augment the actual user prompt to add in best practices. For example, you might choose an analytics template and simply ask, “How can I improve revenue?”
ChatAible detects existing enterprise data that can potentially help answer the question and analyzes the data with your permission or sends the data to VertexAI for summarization. It will then augment your original prompt, instructing the gen AI model to pay attention to the relevant data when it responds. ChatAible also applies model settings like temperature, according to best practices based on the model and use case. As different gen AI models evolve, Aible can simply update the Chat Templates, allowing all Aible customers to benefit instantly.
In addition, ChatAible also allows you to customize your personal preferences. For example, some users may prefer responses in bulleted list form while others prefer long-form descriptions. You can even choose to chat in a different language. The result is that ChatAible is personalized for each user, providing a better Day 1 experience that enables business users to get the most out of gen AI without having to go through extensive, targeted training.
1. Provide Feedback: Users are used to providing thumbs up / down feedback on ChatGPT or Gemini. They can do exactly the same thing with ChatAible and the feedback data is stored in the customer’s Google Cloud project so it can be later used to improve models. But, some kinds of model improvements require examples of bad and good chat responses. Such feedback typically has to be collected by Data Scientists from Business Users – usually manually using spreadsheets. In ChatAible, business users can simply edit the chat response into what they would have preferred the response to have been and the system stores the feedback data (along with any relevant Retrieval Augmented Generation context) in the customer’s Google Cloud project so it can be later used to improve models. This is a far simpler process for gathering business user feedback and the more simple you make this process, the more relevant examples you can gather for model improvement.
2. Automatically Improve: There are many ways to improve models such as reinforcement learning, fine tuning, data encoding, Low-Rank Adaptation (LoRA) etc. Each of these techniques require different amounts of data. Improvements from some of these techniques such as LoRa can be stacked. For example, if I developed a LoRA that understands comedy better, and you developed one that understood Spanish better, the two LoRA combined should understand Spanish comedy better. New approaches and best practices for improving models are being developed everyday. For business users this is too much complexity.
With ChatAible, users just specify whether they want the improvements to be stackable and ChatAible informs them how much more data they need to collect for different improvement techniques. Once they have sufficient data, they click a button and the model improvement is conducted using all of the capabilities of Vertex AI and the relevant Google tools for improving models.
Users can also edit the default prompt augmentations or model settings of the Chat Templates to see if they can improve the GenAI output. This is simple trial and error with immediate feedback under the control of the business users. Because Aible is end-to-end serverless, users can easily try different combinations of VectorDB settings, model settings, prompt augmentation, etc. in parallel as A-Z testing. Combinations with lower ratings, higher costs, higher latencies can then be pruned easily and user requests can be directed to better performing combinations.
3. Secure: Organizations are concerned about users accidentally getting access to data that they are not authorized for through the fine-tuning process. Because of Aible’s LORA-stacking and serverless design, fine-tuning can be done based on feedback provided by users with specific roles and users can be restricted to fine-tuned models that were never trained on data that the users would not have had access to. Roles with higher privilege levels can then use models that stack LORA trained on feedback from users at lower permission levels. This way organizations can ensure that fine-tuned models don’t accidentally breach their data security expectations.
4. Share: Today business users can share individual chats and collaborate on the chats. They can also share model improvements or updated Chat Templates with other users in their organization. Aible’s services partners can share Chat Templates and updated models with their customers (with appropriate shared-customer permissions). Eventually, users will also be able to share chat templates and model improvements across organizations.
Scaling business use cases
Gen AI will be transformative for your business. The faster you get large numbers of people across your organization safely experimenting with it, the faster you can gain competitive advantage from it. As more teams identify use cases, it’s likely you will need an easy way to manage and scale gen AI across your organization.
Leveraging Google Cloud technologies, which are built for scale, Aible makes it easy to adapt and scale gen AI use cases and gen AI applications for wider adoption. For example, ChatAible’s Chat Templates make it easy to fine-tune and improve gen AI models, allowing you to easily change chat context according to new use cases simply by changing your templates. This allows you to provide a personalized, consistent experience for many different types of users across multiple gen AI use cases and enterprise applications.
The proof of the pudding is truly in how quickly organizations can prototype new gen AI solutions. For example, multiple joint customers presented at the Gartner Data and Analytics Summit, sharing how they built gen AI solutions in a matter of days with Aible. Given recent estimates that as much as 30% of gen AI proof of concepts will be abandoned by 2025, there is an even greater need than ever for rapid implementation and iteration. A failure that is encountered in a matter of days is never a catastrophe, merely a lesson learnt on the way to gen AI success.
Gen AI success from day 1
Aible sets organizations up for success, empowering all users — including business users — to quickly get started on their generative AI use cases. Aible’s capabilities (backed and secured by Google Cloud technologies), such as templated best practices, prompt augmentation at scale, hallucination double-checking, and more ensure business users have a productive and safe experience from the moment they start using gen AI. This kind of bottoms-up, business-user-led innovation is key to realizing rapid gen AI value at scale.
Etsy, a leading ecommerce marketplace for handmade, vintage, and unique items has a passion for delivering innovative and seamless experiences for customers. Like many fast growing companies, Etsy needed to scale their teams, technologies, and tools to keep pace with their business growth. Indeed, between 2012 and 2021, their gross merchandise sales increased over 1400% to $13.5 billion.
As part of Etsy’s efforts to keep pace with this growth, the company migrated all their infrastructure from traditional data centers to Google Cloud. This shift not only marked a significant technological milestone, but also prompted Etsy to rethink its service development approach. The journey led to the creation of “ESP” (“Etsy’s Service Platform”), an Etsy-tailored service platform running on Google Cloud Run, which is a customized platform built on Google Cloud Run that streamlines the development, deployment, and management of microservices.
This blog post will delve into Etsy’s experience building the service platform, how Cloud Run helped them accomplish their vision, highlight lessons learned, and share how their platform continues to evolve.
aside_block
<ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3eaf7f265e50>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>
The need for change and architectural vision
As Etsy grew, so did the demand for our engineering organization to support richer functionality and higher traffic volume in our marketplace. Our migration to GCP in 2018 enabled Etsy engineers to explore and leverage Google Cloud based service platforms, however this explosion of technical creativity also gave rise to some new challenges, including duplicated scaffolding and code, and unsupported infrastructure with uncertain ownership.
To address these challenges, Etsy assembled a squad of architects to craft a vision detailing what future service development at Etsy would look like. The goal was clear: create a platform that decouples service writing from infrastructure, liberating developers from the burden of backend complexities and allowing them to quickly and safely deploy new services.
Transforming vision into reality
The resulting architectural vision became the blueprint for ESP, Etsy’s Service Platform, and a newly formed squad was to take on the exciting challenge of transforming the Vision into reality. The first step was assembling a dynamic team capable of bridging the gap between infrastructure and application development. Comprising seasoned engineers with diverse expertise, the team brought a rich blend of skills to the table.
Recognizing the importance of aligning with our future platform customers, the team collaborated closely with Etsy architecture and engineering. The Ads Platform Team, already engaged in service development, played a pivotal role by agreeing to embed one of their senior engineers in the service platform team. Together, they delivered a Minimum Viable Platform (MVP) to support the deployment of a new Ads Platform service as the ESP pilot.
Choosing Cloud Run for accelerated development
A successful service platform, according to our architectural vision, would streamline the developer experience by decoupling infrastructure and automating its provisioning. The team recognized that our potential customers from the larger engineering organization also needed a platform that integrated into their workflow with as little friction as possible. To achieve this, the service platform team chose to focus on Etsy-specific aspects: developer experience and language support, CI/CD, integration with existing services, observability, service catalog, security, and compliance.
The decision to leverage Google Cloud services, especially Cloud Run, was strategic. While alternatives like GKE were enticing, the team wanted to deliver value quickly. Cloud Run’s robust and intuitive design allowed the team to focus on core platform functionality, letting Cloud Run handle the more complex and time-consuming aspects of running containerized services.
The Toolbox: A Closer Look
To provide a consistent and efficient developer and operational experience, ESP relies on a carefully selected toolbox:
Developer Interface: A custom CLI tool for streamlined developer interactions.
Protocols: gRPC and protobuf for standardized communication.
Language Support: Go, Python, Node, PHP, Java, Scala.
CI/CD: GitHub Actions for a smooth integration and deployment pipeline.
Observability: Leveraging OTEL on Google Cloud services and Google Monitoring and Logging, along with Prometheus and AlertManager
Client Library: ESP generated clients are registered in Artifactory
Service Catalog: Utilizing Backstage for centralized service visibility.
Runtime: Cloud Run, chosen for its simplicity and compatibility.
Navigating Challenges
The path to creating the service platform encountered obstacles. The VPC connector experienced overloading, and some services required fine-tuning to optimize resource allocation. However, these challenges led to platform-level improvements that benefit future adopters.
ESP’s design prioritized flexibility to accommodate our diverse technology landscape. While the team possessed expertise in various technologies, creating a one-size-fits-all platform supporting multiple service and client languages across diverse use cases was challenging. We decided to initially focus on a core feature set and add incremental capabilities and workarounds based on user feedback.
As ESP matured, valuable lessons shaped both day-to-day operations and its future evolution.
Sandbox Feature: A “sandbox” environment accelerated iteration, enabling developers to launch development versions of new services on Cloud Run in under five minutes, complete with CI/CD and observability.
Familiar Observability Tools: ESP integrated with our existing tools like promQL and Grafana, streamlining workflows for engineers.
Security Considerations: While ESP favored TLS and layer 7 authentication using Google IAM, collaboration with the Google Serverless Networking team ensured secure connectivity with our legacy applications.
Supporting AI/ML Innovation: During a company-wide hackathon, ESP’s adaptability shone as a service interfacing with Google’s Vertex AI was rapidly deployed.
Real-World Success: The Ads Platform service expanded to three additional systems as client support in more languages rolled out. Cloud Run’s auto-scaling easily handled the increased load.
Conclusion and Future Outlook
ESP enables our engineers to be bold, fast, and safe, and is experiencing steady and continued adoption throughout the organization. Customer requests for workloads beyond the serverless model have spurred collaboration with Google and our internal GKE team. The goal is to extend ESP’s tooling to support an expanding class of services while maintaining a consistently high level of operational and developer experience.
The journey to pilot, challenges overcome, and future outlook highlight the dynamic and iterative nature of our service platform journey. ESP stands as a testament to our ability to adapt, innovate, and empower Etsy’s engineering community to meet the ever-growing needs of our marketplace and business.
As generative AI experiences explosive growth fueled by advancements in LLMs (Large Language Models), access to open models is more critical than ever for developers. Open models are publicly available pre-trained foundational LLMs. Platforms like Google Cloud’s Vertex AI, Kaggle and Hugging Face already provide easy access to open models to data scientists, ML engineers, and application developers.
Some of these models require powerful infrastructure and deployment capabilities, which is why today we’re excited to announce the capability to deploy and serve open models such as Llama 3.1 405B FP16 LLM over GKE (Google Kubernetes Engine). Published by Meta, Llama 3.1 with 405 billion parameters demonstrates significant improvements in general knowledge, reasoning abilities, and coding proficiency. When run at FP (floating point) 16 precision to store and process 405 billion parameters, the model requires more than 750GB GPU memory for inference. The GKE solution described in this article makes the challenge of deploying and serving such large models easier to achieve.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud AI and ML’), (‘body’, <wagtail.rich_text.RichText object at 0x3eaf7d687dc0>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>
Customer experience
As a Google Cloud customer, you can find the Llama 3.1 LLM by going to Vertex AI Model Garden and selecting the Llama 3.1 model tile.
After clicking the deploy button, you can select GKE and pick the Llama 3.1 405B FP16 model.
On this page, you can find the auto generated Kubernetes yaml and detailed instructions for deployment and serving Llama 3.1 405B FP16.
Multi-host deployment and serving
Llama 3.1 405B FP16 LLM requires more than 750 GB GPU memory and presents considerable challenges for deployment and serving. In addition to the memory consumed by model weights, factors such as KV (Key-Value) cache storage and longer sequence length support also contribute to the overall memory requirements. Currently the most powerful GPU offering in the Google Cloud platform, the A3 virtual machines, is equipped with 8 H100 Nvidia GPUs, each featuring 80 GB of HBM (High-Bandwidth Memory). For serving LLMs like the FP16 Llama 3.1 405B model, multi-host deployment and serving is the only viable solution. We use LeaderWorkerSet with Ray and vLLM to deploy over GKE.
LeaderWorkerSet
The LeaderWorkerSet (LWS) is a deployment API specifically developed to address the workload requirements of multi-host inference, facilitating the sharding and execution of the model across multiple devices on multiple nodes. Constructed as a Kubernetes deployment API, LWS is both cloud agnostic and accelerator agnostic, and can run on both GPUs and TPUs. LWS leverages the upstream StatefulSet API as its fundamental building block, as illustrated below.
Within the LWS architecture, a group of pods is managed as a singular entity. Each pod within this group is assigned a unique index ranging from 0 to n-1, with the pod bearing the index 0 designated as the leader of the group. The creation of each pod within the group is executed concurrently, and they share an identical lifecycle. LWS facilitates rollout and rolling updates at the group level. Each group is regarded as a single unit for rolling updates, scaling, and mapping to an exclusive topology for placement. The upgrade process for each group is executed as a single atomic unit, ensuring that all pods within the group are updated simultaneously. Co-location of all pods within the same group in the same topology is permissible, with optional support for topology-aware placement. The group is treated as a single entity in the context of failure handling as well, with optional all-or-nothing restart support. When enabled, all pods within the group will be recreated if a single pod in the group fails or if a single container within any of the pods is restarted.
Within the LWS framework, the concept of a replica encompasses a group consisting of a single leader and a set of workers. LWS supports dual templates, one designated for the leader and the other for the workers. LWS provides a scale endpoint for HPA, enabling dynamic scaling of the number of replicas.
Multi-host deployment with vLLM and LWS
vLLM is a popular open source model server and supports multi-node multi-GPU inference by employing tensor parallelism and pipeline parallelism. vLLM supports distributed tensor parallelism with Megatron-LM’s tensor parallel algorithm. For pipeline parallelism, vLLM manages the distributed runtime with Ray for multi-node inferencing.
Tensor parallelism involves horizontally partitioning the model across multiple GPUs, resulting in the tensor parallel size being equivalent to the number of GPUs within each node. It is important to note that this approach necessitates fast network communication among the GPUs.
On the other hand, pipeline parallelism vertically partitions the model by layer and does not demand constant communication between GPUs. Typically, this corresponds to the number of nodes employed for multi-host serving.
The combination of these parallelism strategies is essential to accommodate the entirety of the Llama 3.1 405B FP16 model. Two A3 nodes, each equipped with 8 H100 GPUs, will provide an aggregate memory capacity of 1280 GB, sufficient to accommodate the model’s 750 GB memory requirement. This configuration will also provide the necessary buffer memory for the key-value (KV) cache and support long context lengths. For this LWS deployment, the tensor parallel size is set to 8, while the pipeline parallel size is set to 2.
Summary
In this blog, we showed you how LWS gives you the essential capabilities required for multi-host serving. This technique can also serve smaller models, such as Llama 3.1 405B FP8, on more cost-effective machines, which optimizes price-to-performance ratios. To learn more, visit this blog post that shows how to pick a machine type that fits your model. LWS is open sourced and has strong community engagements – take a look at our Github to learn more and contribute directly.
As Google Cloud Platform helps customers adopt a gen AI workload, you can come to Vertex AI Model Garden to deploy and serve open models over managed Vertex AI backend or GKE DIY (Do It Yourself) clusters. Our goal is to create a seamless customer experience, one example is multi-host deployment and serving. We look forward to hearing your feedback.
With the technology advances of our cloud-first databases, Google Cloud has become the go-to platform for companies looking to run complex, real-time, business-critical workloads.Don’t just take our word for it. Today, we’re pleased to announce that Google was named a Leader in The Forrester Wave™: Translytical Data Platforms, Q4 2024 report. We believe this recognition solidifies AlloyDB as a top choice for translytical workloads, where it handles transactional, analytical, and AI workloads all on a single database.
The AlloyDB difference
Fuelingthis recognition is AlloyDB’s differentiated architecture, which combines the performance of a traditional relational database with the scalability and flexibility of cloud-first technology. Also, AlloyDB eliminates the need for complex data pipelines and separate databases for transactional, analytical, and gen AI workloads. Businesses are able to gain real-time integrated insights and build AI-powered customer experiences without compromising operational efficiency, which ultimately accelerates innovation and drives better business outcomes.
But AlloyDB doesn’t stop there; it goes further. We built AlloyDB based on the following principles:
Cloud-first – built from the ground up for the cloud, leveraging the same infrastructure that powers Google’s billion-user products.
Embrace open standards – built on PostgreSQL, one of the most popular open-source relational databases. This lets you avoid vendor lock-in and leverage a vast ecosystem of tools and resources.
Superior scalability – designed to take advantage of Google Cloud’s disaggregated compute and storage architecture. This allows AlloyDB to scale both vertically and horizontally to support your most demanding workloads (scaling vertically to 128 vCPU and horizontally to over 2000 vCPUs with read pools).
No data movement – the fully integrated columnar engine automatically transforms operational row data into a columnar in-memory format to accelerate analytical queries. This means that you don’t need separate systems for transactional and analytical processing and eliminates the need for ETL. AlloyDB’s columnar engine gives you up to 100x better performance for analytical queries compared to standard PostgreSQL.
AI-powered insights – incorporates Google’s cutting-edge ScaNN vector search technology. AlloyDB AI delivers up to 4x faster vector queries and typically uses 3-4x less memory than the HNSW index in standard PostgreSQL. This helps large workloads run on smaller shapes and improves performance for hybrid workloads.
Multi-cloud and hybrid cloud – designed to excel in any environment. Deploy AlloyDB on-premises, at the edge, or even on your laptop with AlloyDB Omni. For a fully managed multi-cloud experience, Aiven for AlloyDB Omni offers a simplified and secure way to deploy, manage, and scale your database across Google Cloud, AWS, and Azure.
Latest industry recognition
In addition to being named a Leader in The Forrester Wave™: Translytical Data Platforms, Q4 2024, Google received the highest scores possible in 11 of the evaluation criteria including Vision, Innovation, Gen AI/LLM, Real-time Analytics, Data Security, and several others.
Customers such as Chicago Mercantile Exchange (CME) Group, Bayer, and Tricent leverage AlloyDB, standardizing on an open-source-compatible, enterprise-grade database that is highly scalable and reliable and giving them the ability to launch new products faster and expand into new markets.
“Google offers a strong vision and solid price performance at scale. … Reference customers value Google for its comprehensive analysis tools, advanced LLM capabilities, exceptional technical support, and documentation. Google’s AlloyDB is an excellent choice for firms looking for a reliable, scalable, and cost-effective database tailored for translytical workloads with a vision of evolving towards a more global and distributed architecture.” – The Forrester Wave™: Translytical Data Platforms, Q4 2024
Getting started is a breeze
If you’re looking for a database that can handle your most demanding translytical workloads with exceptional performance, availability and scale, AlloyDB is the clear choice, evolving with your business and providing the tools and capabilities you need to stay ahead of the curve.
If you’re new to Google Cloud, you can sign up for an AlloyDB free trial. If you’re already a Google Cloud customer, head over to the AlloyDB console. Once you’re there, click “Create a trial cluster” and we’ll guide you through migrating your data to a new AlloyDB database instance.
The future of PostgreSQL is here, and it’s built for you. Read more about AlloyDB in our brand-new e-book. Then start your free trial and see what AlloyDB can do for your applications.
With the general availability of Cloud NGFW Enterprise, Google Cloud customers are encouraged to transition from legacy VPC firewall rules to Cloud NGFW’s powerful and flexible firewall policies. In addition to the benefits we outlined in the earlier blog, Why you should migrate to network firewall policies from VPC firewall rules, you can now benefit from the enhanced network security controls included in Cloud NGFW Standard and Enterprise tiers by migrating to the new policy model.
To help make this process easy and smooth-proof, we have developed a migration tool that automates most parts of this process. Let’s delve into the migration process and explore how to upgrade your network security infrastructure to harness the full potential of Google Cloud NGFW.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 to try Google Cloud networking’), (‘body’, <wagtail.rich_text.RichText object at 0x3eaa61b34c10>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/products?#networking’), (‘image’, None)])]>
Transitioning to firewall policies: a simple case
Imagine a straightforward migration scenario with VPC firewall rules that don’t involve network tags or service accounts. We’ll use this simplified use case to demonstrate the power of the automated migration tool in streamlining the transition to firewall policies.
The migration process starts by scanning the configured VPC firewall rules and generates an equivalent firewall policy with corresponding rules. In cases where duplicate priorities exist within the original VPC firewall rules, the tool automatically adjusts the rule priorities to help provide a seamless and conflict-free transition.
Once the firewall policy is created and carefully reviewed, it has to be attached to the VPC. Make sure logging is enabled to monitor rule hit counts. To ensure that the new policy takes precedence, the enforcement order is switched, prioritizing the firewall policy over legacy VPC firewall rules. Continued monitoring of hit counts reveals the gradual shift towards the new rules, with the legacy rules eventually receiving zero hits. At this point you should be able to disable the old rules, validate possible negative impacts, then delete the old legacy VPC firewall rules.
Once you’re satisfied with the new policy, you can use the migration tool to help you generate a corresponding Terraform script for this policy.
This marks a crucial point where you can add the advanced NGFW features into this base policy object, including IDPS, TLS inspection, geo-restrictions, FQDN-based filtering, address groups, and Google-managed Threat Intelligence IP lists. You can first test the features by directly modifying the policy objects associated with your network. Once the tests complete successfully, you can then manually add these new parameters into the Terraform script you had for the base policy before you incorporate it into your production CI/CD pipeline.
Transitioning to firewall policies: a complex case
In the previous section, we discussed a relatively straightforward migration scenario with no dependencies. But most existing environments contain dependencies — VPC firewall rules that reference network tags and/or service accounts. Firewall policies do not support network tags, but do support secure tags, which provide IAM controls. Service accounts are supported in a firewall policy as targets for applying the rule, however can not be used to evaluate rules themselves. In the traditional VPC firewall rule, a service account can be used as a source filter for ingress traffic. Firewall policy requires a tag to be used as a source filter instead.
Because of this change, some pre-work is required for a successful migration. To start, you can use the migration to identify all of the network tags and/or service accounts that are referenced in the VPC firewall rules, and output a JSON file for tag-mapping.
Create the required secure tags for the list of network tags and/or service accounts listed in the mapping file. Edit the JSON file to map the network tags and the service accounts to the corresponding secure tags.
Here is an example of updating the mapping file with the tag values of newly created tags, which replace network tags and service accounts.
Once the mapping file has been manually updated with the tag values for all secure tags, you are now ready to use the migration tool to bind these tags to the relevant VMs. This process binds secure tags to the instances with the network tags and/or service accounts that match the secure tags from the tag mapping file. Note that if there are managed instance groups (MIGs) that contain network tags, you need to manually update them to use the updated secure tags.
Now you can reference the tag-mapping file to the migration tool to generate the migrated firewall policy with all the secure tags mapped. As described in the section about the simple use case, you have two options for how to proceed with the migration: automatic firewall policy creation
Once the firewall policy is created by your choice of deployment, carefully review it. Ensure that the network tags and/or service accounts from VPC firewall rules are appropriately replaced by the corresponding secure tag.
Then, when you’re ready, associate the VPC and ensure logging is enabled to monitor rule hit counts. To ensure the new policy takes precedence, switch the enforcement order, prioritizing the firewall policy over legacy VPC firewall rules. Continued monitoring of hit counts will reveal the gradual shift towards the new rules, with the legacy rules eventually receiving zero hits. At this point you can disable the VPC firewall rules, validate possible negative impacts, and eventually delete the VPC firewall rules.
Similarly to the simple use case, this marks a crucial point where you can leverage firewall policies’ enhanced capabilities and integrate advanced features like IDPS, TLS inspection, geo-restrictions, FQDN-based filtering, Network Threat Intelligence, and more.
Advanced migrations: GKE VPC firewall rules
In most cases, the migration tool attempts to migrate the existing VPC firewall rules to new firewall policy rules, with one exception: for VPC firewall rules created by Google Kubernetes Engine, which it skips. GKE is unique in that it automatically creates VPC firewall rules when deploying a cluster, service, ingress, gateway, or HTTPRoutes.
That means that if your node pools use network tags, you’ll need to manually update the nodepool configuration to use the corresponding secure tag.
From there, continue to use the steps for the simple use case for a dependency-free migration, or for the complex case to migrate and create your firewall policies with network tags/service accounts.
After validating the firewall policy, associate the VPC to the policy.
Then, because your firewall policy does not contain the GKE auto-generated rules, follow these three steps:
Keep the existing enforcement order and disable the user defined VPC firewall rules.
Manually create a firewall policy rule to allow/inspect ingress traffic destined to the GKE service IP.
Disable the GKE auto-generated allow ingress VPC firewall rule for GKE service (source: 0.0.0.0/0 and destination: load balancer IP).
Auto generated GKE VPC firewall rules that have been excluded have the following regular expressions (regex):
With the exception of the GKE service firewall rule, all GKE workloads will continue to function with VPC firewall rules. Your non-GKE workloads and GKE service traffic will be processed by the firewall policies rules.
Now hopefully you have got a good idea on how to proceed with your migration project to upgrade your VPC firewall rules to the new network firewall policies and leverage the advanced NGFW feature set.