2025 02 11

GCP – Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure

The recent explosion of machine learning (ML) applications has created unprecedented demand for power delivery in the data center infrastructure that underpins those applications. Unlike server clusters in the traditional data center, where tens of thousands of workloads coexist with uncorrelated power profiles, large-scale batch-synchronized ML training workloads exhibit substantially different power usage patterns. Under these new usage conditions, it is increasingly challenging to ensure the reliability and availability of the ML infrastructure, as well as to improve data-center goodput and energy efficiency.

Google has been at the forefront of data center infrastructure design for several decades, with a long list of innovations to our name. In this blog post, we highlight one of the key innovations that allowed us to manage unprecedented power and thermal fluctuations in our ML infrastructure. This innovation underscores the power of full codesign across the stack — from ASIC chip to data center, across both hardware and software. We also discuss the implications of this approach and propose a call to action for the broader industry.

New ML workloads lead to new ML power challenges

Today’s ML workloads require synchronized computation across tens of thousands of accelerator chips, together with their hosts, storage, and networking systems; these workloads often occupy one entire data-center cluster — or even multiples of them. The peak power utilization of these workloads could approach the rated power of all the underlying IT equipment, making power overscription much more difficult. Furthermore, power consumption rises and falls between idle and peak utilization levels much more steeply, due to the fact that the entire cluster’s power usage is now dominated by no more than a few large ML workloads. You can observe these power fluctuations when a workload launches or finishes, or when it is halted, then resumed or rescheduled. You may also observe a similar pattern when the workload is running normally, mostly attributable to alternating compute- and networking-intensive phases of the workload within a training step. Depending on the workload’s characteristics, these inter- and intra-job power fluctuations can occur very frequently. This can result in multiple unintended consequences on the functionality, performance, and reliability of the data center infrastructure.

Fig. 1. Large power fluctuations observed on cluster level with large-scale synchronized ML workloads

In fact, in our latest batch-synchronous ML workloads running on dedicated ML clusters, we observed power fluctuations in the tens of megawatts (MW), as shown in Fig.1. And compared to a traditional load variation profile, the ramp speed could be almost instantaneous, repeat as frequently as every few seconds, and last for weeks… or even months!

Fluctuations of this kind pose the following risks:

Functionality and long-term reliability issues with rack and data center equipment, resulting in hardware-induced outages, reduced energy efficiency and increased operational/maintenance costs, including but not limited to rectifiers, transformers, generators, cables and busways
Damage, outage, or throttling at the upstream utility, including violation of contractual commitments to the utility on power usage profiles, and corresponding financial costs
Unintended and frequent triggering of the uninterrupted power supply (UPS) system from large power fluctuations, resulting in shortened lifetime of the UPS system

Large power fluctuations may also impact hardware reliability at a much smaller per-chip or per-system scale. Although the maximum temperature is well under control, power fluctuations may still translate into large and frequent temperature fluctuations, triggering various forms of interactions including warpage, changes to thermal interface material property, and electromigration.

aside_block: <ListValue: [StructValue([(‘title’, ‘Try Google Cloud for free’), (‘body’, <wagtail.rich_text.RichText object at 0x3e3758e60460>), (‘btn_text’, ‘Get started for free’), (‘href’, ‘https://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

A full-stack approach to proactive power shaping

Due to the high complexity and large scale of our data-center infrastructure, we posited that proactively shaping a workload’s power profile could be more efficient than simply adapting to it. Google’s full codesign across the stack — from chip to data center, from hardware to software, and from instruction set to realistic workload — provides us with all the knobs we need to implement highly efficient end-to-end power management features to regulate our workloads’ power profiles and mitigate detrimental fluctuations.

Specifically, we installed instrumentation in the TPU compiler to check on signatures in the workload that are linked with power fluctuations, such as sync flags. We then dynamically balance the activities of major compute blocks of the TPU around these flags to smooth out their utilization over time. This achieves our goal of mitigating power and thermal fluctuations with negligible performance overhead. In the future, we may also apply a similar approach to the workload’s starting and completion phases, resulting in a gradual, rather than abrupt, change in power levels.

We’ve now implemented this compiler-based approach to shaping the power profile and applied it on realistic workloads. We measured the system’s total power consumption and a single chip’s hotspot temperature with, and without, the mitigation, as plotted in Fig. 2 and Fig. 3, respectively. In the test case, the magnitude of power fluctuations dropped by nearly 50% from the baseline case to the mitigation case. The magnitude of temperature fluctuations also dropped from ~20 C in the baseline case to ~10 C in the mitigation case. We measured the cost of the mitigation by the increase in average power consumption and the length of the training step. With proper tuning of the mitigation parameters, we can achieve the benefits of our design with small increases in average power with <1% performance impact.

Fig. 2. Power fluctuation with and without the compiler-based mitigation

Fig. 3. Chip temperature fluctuation with and without the compiler-based mitigation

A call to action

ML infrastructure is growing rapidly and expected to surpass traditional server infrastructure in terms of total power demand in the coming years. At the same time, ML infrastructure’s power and temperature fluctuations are unique and tightly coupled with the ML workload’s characteristics. Mitigating these fluctuations is just one example of many innovations we need to ensure reliable and high-performance infrastructure. In addition to the method described above, we’ve been investing in an array of innovative techniques to take on ever-increasing power and thermal challenges, including data center water cooling, vertical power delivery, power-aware workload allocation, and many more.

But these challenges aren’t unique to Google. Power and temperature fluctuations in ML infrastructure are becoming a common issue for many hyperscalers and cloud providers as well as infrastructure providers. We need partners at all levels of the system to help:

Utility providers to set forth a standardized definition of acceptable power quality metrics — especially in scenarios where multiple data centers with large power fluctuations co-exist within a same grid and interact with one another
Power and cooling equipment suppliers to offer quality and reliability enhancements for electronics components, particularly for use-conditions with large and frequent power and thermal fluctuations
Hardware suppliers and data center designers to create a standardized suite of solutions such as rack-level capacitor banks (RLCB) or on-chip features, to help establish an efficient supplier base and ecosystem
ML model developers to consider the energy consumption characteristics of the model, and consider adding low-level software mitigations to help address energy fluctuations

Google has been leading and advocating for industry-wide collaboration on these issues through forums such as Open Compute Project (OCP) to benefit the data center infrastructure industry as a whole. We look forward to continuing to share our learnings and collaborating on innovative new solutions together.

^{A special thanks to Denis Vnukov, Victor Cai, Jianqiao Liu, Ibrahim Ahmed, Venkata Chivukula, Jianing Fan, Gaurav Gandhi, Vivek Sharma, Keith Kleiner, Mudasir Ahmad, Binz Roy, Krishnanjan Gubba Ravikumar, Ashish Upreti and Chee Chung from Google Cloud for their contributions.}

GCP – Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure

New ML workloads lead to new ML power challenges

A full-stack approach to proactive power shaping

A call to action

Related Posts

AWS – Amazon SQS expands IPv6 support to VPC endpoints in AWS commercial Regions

AWS – Amazon EC2 High Memory U7i instances now support higher Amazon EBS-optimized instance performance

AWS – AWS Private CA now supports issuing up to 100 million certificates per CA