GCP – No work items left unturned: How Dataflow mitigates stragglers
Dataflow, Google Cloud’s fully managed streaming analytics service, has a proud tradition of getting the job done (pun intended) with completeness and with precision. Practically speaking, that means leaving no shard behind and doing things exactly once. Today, we are excited to announce three new features: straggler detection, hot key detection, and slow worker auto-mitigation.
What are stragglers and how does Dataflow help?
Stragglers are tasks that take significantly longer to complete than the average task of the same type. This can happen for a variety of reasons, such as hardware differences, network conditions, or uneven data distribution. Stragglers can cause a number of problems for data processing pipelines, including increased latency, reduced throughput, and increased costs.
Dataflow helps users deal with stragglers in three ways:
In-context tooling for detailed diagnosis: Dataflow provides a variety of observability tools to help users identify and diagnose stragglers.
Root-cause analysis: Dataflow analyzes stragglers to identify the root cause of the problem. This information can be used to prevent future stragglers from occurring.
Auto-mitigation: Wherever possible, Dataflow automatically prevents and mitigate stragglers with proactive load balancing and fixing unhealthy nodes.
This approach helps users minimize the impact of stragglers on their pipelines.
Dealing with stragglers: In-context observability
The first strategy for dealing with stragglers in Dataflow is in-context observability. We help users identify if there are stragglers and which node and worker they are occurring on. This allows users to quickly drill deeper into the pipeline to diagnose and mitigate the straggler.
Before we go further, it is helpful to go over some key terms:
Step: Reads, writes, and transformations defined by user code
Fusion: The process of Dataflow fusing multiple steps or transforms to optimize a pipeline
Stage: The unit of fused steps in Dataflow pipelines
New: Straggler detection
Dataflow’s new straggler detection feature helps customers identify when pipelines contain stragglers.
Batch detection
A work item is considered a straggler if ALL THREE conditions are satisfied:
It takes an order of magnitude longer to complete than other work items in the same stage.
It reduces parallelism within the stage.
It blocks new work from starting.
Streaming detection
A work item is considered a straggler if ALL THREE conditions are satisfied:
The stage the work item is running in has to have a watermark lag of > 10 min.
The work item has been processing for > 5 min.
The work item has been processing for 1.5 times longer than the average work item in that stage.
Once detected, stragglers are surfaced in the UI under the Execution Details tab (batch/streaming). You’ll be able to filter your logs to the user worker on which the straggler was detected during the incident’s time-frame (batch/streaming).
Dealing with stragglers: Root-cause analysis
Once a straggler is identified, Dataflow attempts to determine the cause where possible. There are a number of factors that can create stragglers, so pinpointing the underlying root cause saves debugging time and lets you focus on mitigating the issue.
New: Hot key detection
A hot key is a key that represents significantly more elements than other keys in the same PCollection (i.e., input data that is skewed). Hot keys can stifle Dataflow’s ability to perform work in parallel as they introduce long chains of sequential work into the parallel job.
Dataflow automatically detects when a hot key occurs. Additionally, when jobs are run with the `hotKeyLoggingEnabled` pipeline option enabled, Dataflow logs the detected key to help with debugging.
Hot keys typically cannot be fully remediated with performance levers such as horizontally or vertically scaling your workers. Instead, solving hot keys often requires evaluating your pipeline’s steps and narrowing in on where keys may become disproportionately distributed.
An example of a remediation strategy is the use of a shuffle. Given a dataset that has uneven distribution around a key, a user can add an extra sharding key. This further partitions the data into smaller chunks, thereby increasing parallelism.
Dealing with stragglers: Auto-mitigation
Your best defense against stragglers in Dataflow is auto-mitigation. This proactively avoids straggler impact to pipelines, with no intervention on your part required. As a fully managed service, Dataflow aims to provide “zero-knobs” solutions wherever possible so you can focus on your business logic without having to worry about managing the underlying infrastructure.
Dynamic work rebalancing
Dataflow’s dynamic work rebalancing feature can help to prevent stragglers by identifying them early and redistributing their work to other workers. It works by monitoring the progress of each worker in the pipeline and redistributing work to workers that are finishing faster. This helps to ensure that all of the workers are working at approximately the same rate.
New: Slow worker auto-mitigation
Dataflow auto-mitigates stragglers caused by slow workers. Many factors can cause worker slowness, such as CPU starvation, thrashing, problematic machine architecture, and stuck worker processes. On a slow worker, work items will be processed much more slowly than they should be and eventually result in a straggler.
When a slow worker is detected, the auto-mitigation feature simulates a host maintenance event to trigger the worker’s host maintenance policy for a live migration, restart, or stop. If a live migration occurs, the processed work on the slow worker will be transitioned seamlessly to a new worker without losing progress or any in-flight data.
If a slow worker is detected and successfully auto-mitigated, the following message displays in the Dataflow job-message logs: Slow worker … detected and automatically remediated …. Because slow workers are not stragglers but a cause of stragglers, once it’s remediated, you don’t need to take further action.
Getting started
To learn more about troubleshooting stragglers in Dataflow, review these guides:
https://cloud.google.com/dataflow/docs/guides/troubleshoot-batch-stragglers
https://cloud.google.com/dataflow/docs/guides/troubleshoot-streaming-stragglers
We thank the Google Cloud team members who co-authored the blog: Claire McCarthy, Software Engineer, Matthew Li, Software Engineer, Ning Kang, Software Engineer, Sam Rohde, Software Engineer.
Read More for the details.