GCP – Meet Kubernetes History Inspector, a log visualization tool for Kubernetes clusters
Kubernetes, the container orchestration platform, is inherently a complex, distributed system. While it provides resilience and scalability, it can also introduce operational complexities, particularly when troubleshooting. Even with Kubernetes’ self-healing capabilities, identifying the root cause of an issue often requires deep dives into the logs of various independent components.
At Google Cloud, our engineers have been directly confronting this Kubernetes troubleshooting challenge for years as we support large-scale, complex deployments. In fact, the Google Cloud Support team has developed deep expertise in diagnosing issues within Kubernetes environments through routinely analyzing a vast number of customer support tickets, diving into user environments, and leveraging our collective knowledge to pinpoint the root causes of problems. To address this pervasive challenge, the team developed an internal tool: the Kubernetes History Inspector (KHI), and today, we’ve released it as open source for the community.
The Kubernetes troubleshooting challenge
In Kubernetes, each pod, deployment, service, node, and control-plane component generates its own stream of logs. Effective troubleshooting requires collecting, correlating, and analyzing these disparate log streams. But manually configuring logging for each of these components can be a significant burden, requiring careful attention to detail and a thorough understanding of the Kubernetes ecosystem. Fortunately, managed Kubernetes services such as Google Kubernetes Engine (GKE) simplify log collection. For example, GKE offers built-in integration with Cloud Logging, aggregating logs from all parts of the Kubernetes environment. This centralized repository is a crucial first step.
However, simply collecting the logs solves only half the problem. The real challenge lies in analyzing them effectively. Many issues you’ll encounter in a Kubernetes deployment are not revealed by a single, obvious error message. Instead, they manifest as a chain of events, requiring a deep understanding of the causal relationships between numerous log entries across multiple components.
Consider the scale: a moderately sized Kubernetes cluster can easily generate gigabytes of log data, comprising tens of thousands of individual entries, within a short timeframe. Manually sifting through this volume of data to identify the root cause of a performance degradation, intermittent failure, or configuration error is, at best, incredibly time-consuming, and at worst, practically impossible for human operators. The signal-to-noise ratio is incredibly challenging.
- aside_block
- <ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud containers and Kubernetes’), (‘body’, <wagtail.rich_text.RichText object at 0x3e47a7f95670>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectpath=/marketplace/product/google/container.googleapis.com’), (‘image’, None)])]>
Introducing the Kubernetes History Inspector
KHI is a powerful tool that analyzes logs collected by Cloud Logging, extracts state information for each component, and visualizes it in a chronological timeline. Furthermore, KHI links this timeline back to the raw log data, allowing you to track how each element evolved over time.
The Google Cloud Support team often assists users in critical, time-sensitive situations. A tool that requires lengthy setup or agent installation would be impractical. That’s why we packaged KHI as a container image — it requires no prior setup, and is ready to be launched with a single command.
It’s easier to show than to tell. Imagine a scenario where end users are reporting “Connection Timed Out” errors on a service running on your GKE cluster. Launching KHI, you might see something like this:
First, notice the colorful, horizontal rectangles on the left. These represent the state changes of individual components over time, extracted from the logs – the timeline. This timeline provides a macroscopic view of your Kubernetes environment. In contrast, the right side of the interface displays microscopic details: raw logs, manifests, and their historical changes related to the component selected in the timeline. By providing both macroscopic and microscopic perspectives, KHI makes it easy to explore your logs.
Now, let’s go back to our hypothetical problem. Notice the alternating green and orange sections in the “Ready” row of the timeline:
This indicates that the readiness probe is fluctuating between failure (orange) and success (green). That’s a smoking gun! You now know exactly where to focus your troubleshooting efforts.
KHI also excels at visualizing the relationships between components at any given point in the past. The complex interdependencies within a Kubernetes cluster are presented in a clear, understandable way.
What’s next for KHI and Kubernetes troubleshooting
We’ve only scratched the surface of what KHI can do. There’s a lot more under the hood: how the timeline colors actually work, what those little diamond markers mean, and many other features that can speed up your troubleshooting. To make this available to everyone, we open-sourced KHI.
For detailed specifications, a full explanation of the visual elements, and instructions on how to deploy KHI on your own managed Kubernetes cluster, visit the KHI GitHub page. Currently KHI only works with GKE and Kubernetes on Google Cloud combined with Cloud Logging, but we plan to extend its capabilities to the vanilla open-source Kubernetes setup soon.
While KHI represents a significant leap forward in Kubernetes log analysis, it’s designed to amplify your existing expertise, not replace it. Effective troubleshooting still requires a solid understanding of Kubernetes concepts and your application’s architecture. KHI helps you, the engineer, navigate the complexity by providing a powerful map to view your logs to diagnose issues more quickly and efficiently.
KHI is just the first step in our ongoing commitment to simplifying Kubernetes operations. We’re excited to see how the community uses and extends KHI to build a more observable and manageable future for containerized applications. The journey to simplify Kubernetes troubleshooting is ongoing, and we invite you to join us.
Read More for the details.