GCP – AI agents are here. Is your infrastructure ready?
Editor’s note: Today we hear from Dave McCarthy of IDC about a total cost of ownership crisis for AI infrastructure — and what you can do about it. Read on for his insights.
The AI landscape is undergoing a seismic shift. For the past few years, the industry has been focused on the massive, resource-intensive process of training generative AI models. But the focus is now rapidly pivoting to a new, even larger challenge: inference.
Inference — the process of using a trained model to make real-time predictions — is no longer just one part of the AI lifecycle; it is quickly becoming the dominant workload. In a recent IDC global survey of over 1,300 AI decision-makers, inference was already cited as the largest AI workload segment, accounting for 47% of all AI operations.
This dominance is driven by the sheer volume of real-world applications. While a model is trained periodically, it is used for inference non-stop, with every user query, API call, and recommendation. It is also critical to recognize that this inference surge will be distributed across hybrid environments. According to IDC survey respondents, 63% of workloads will reside in the cloud, which remains the standard for scalable applications like content creation and chatbots. In contrast, 37% will be deployed on on-premises infrastructure, usually related to use cases such as robotics and other systems that interact directly with the physical world.
Now, a new factor is set to multiply this demand: the rise of autonomous and semi-autonomous AI agents.
These “agentic workflows” represent the next logical step in AI, where models don’t just respond to a single prompt but execute complex, multi-step tasks. An AI agent might be asked to “plan a trip to Paris,” requiring it to perform dozens of interconnected operations: browsing for flights, checking hotel availability, comparing reviews, and mapping locations. Each of these steps is an inference operation, creating a cascade of requests that must be orchestrated across different systems.
This surge in demand is exposing a critical vulnerability for many organizations: the AI efficiency gap.
The TCO crisis in an age of agents
The AI efficiency gap is the difference between the theoretical performance of an AI stack and the actual, real-world performance achieved. This gap is the source of a Total Cost of Ownership (TCO) crisis, and it’s driven by system-wide inefficiencies.
Our research shows that more than half (54.3%) of organizations use multiple AI frameworks and hardware platforms. While this flexibility seems beneficial, it has a staggering downside: 92% of these organizations report a negative effect on efficiency.
This fragmented “patchwork” approach, stitched together from disparate and non-optimized services, creates a ripple effect of problems:
- 41.6% reported increased compute costs: Redundant processes and poor utilization drive up spending.
- 40.4% reported increased engineering complexity: Teams spend more time managing the fragmented stack than delivering value.
- 40.0% reported increased latency: Bottlenecks in one part of the system (like storage or networking) degrade the overall performance of an application.
The core problem is that organizations are paying for expensive, high-performance accelerators, but are failing to keep them busy. Our data shows that 29% of all AI budget waste is tied to inference. This waste is a direct result of idle GPU time (cited by 29.4% of respondents) and inefficient use of resources (22.3%).
When an expensive accelerator is idle, it’s often waiting for data from a slow storage system or for the application server to prepare the next request. This is a system-level failure, not a component failure.
This failure is often compounded by significant hurdles in data management, which serves as the fuel for these AI engines. Survey respondents highlighted three primary challenges contributing to this gap: 47.7% struggle with ensuring data quality and governance, 45.6% grapple with data storage management and related costs, and 44.1% cite the complexity and time required for data cleaning and preparation. When data pipelines cannot keep pace with high-speed accelerators, the entire infrastructure becomes inefficient.
Closing the gap: From fragmented stacks to integrated systems
To scale cost-effectively in the age of AI agents, we must stop thinking about individual components and start focusing on system-level design.
An agentic workflow, for example, requires tight coordination between two distinct types of compute:
- General-purpose compute: This is the operational backbone. It runs the application servers, orchestrates the workflow, pre-processes data, and handles all the logic around the model.
- Specialized accelerators: This is the high-performance engine that runs the AI model itself.
In a fragmented environment, these two sides are inefficiently connected, and latency skyrockets. The path forward is an optimized architecture where the software, networking, storage, and compute — both general-purpose and specialized — are designed to work as a single, cohesive system.
This holistic approach is the only sustainable way to manage the TCO of AI. It redefines the goal away from simply buying faster accelerators and toward improving the overall “price-performance” and “unit economics” of the entire end-to-end workflow. By eliminating bottlenecks and maximizing the utilization of every resource, organizations can finally close the efficiency gap. Organizations are actively shifting strategies to capture this value. Our survey indicates that 28.9% of respondents are prioritizing model optimization techniques, while 26.3% are partnering with AI service providers to navigate this complexity. Additionally, 25% are investing in training to upskill their teams, ensuring they can increase the value of their AI investments.
The age of inference is here, and the age of agents is right behind it. This next wave of innovation will be won not by the organizations with the most powerful accelerators, but by those who build the most efficient, integrated, and cost-effective systems to power them.
A message from Google Cloud
We sponsored this IDC research to help IT leaders navigate the critical shift to the “Age of Inference.” We recognize that the “efficiency gap” identified here — driven by fragmented stacks and idle resources — is the primary barrier to sustainable ROI. That is why we created AI Hypercomputer: an integrated supercomputer system designed to deliver exceptional performance and efficiency for demanding AI workloads.
IDC surveyed 1,300 global IT leaders to uncover how they are designing their stack for maximum efficiency and ROI. Get your free copy of the whitepaper to learn more: The AI Efficiency Gap: From TCO Crisis to Optimized Cost and Performance.
Read More for the details.
