GCP – Common GKE networking problems, and how to troubleshoot them
Google Kubernetes Engine (GKE) offers a powerful and scalable way to orchestrate containerized applications. Yet, as with any distributed system, networking complexities can present challenges, leading to connectivity issues. This blog post delves into common GKE networking problems, providing step-by-step troubleshooting techniques to address them.
Here are some common GKE connectivity issues that we see:
GKE Cluster control plane connectivity issues
Pods/Nodes in a GKE cluster cannot reach the control plane endpoint, possibly due to network issues.
Internal GKE communications
Pods can’t reach other pods or services inside the same VPC: Each pod in a GKE cluster gets a unique IP address. Connectivity between pods within the cluster may be disrupted, affecting application functionality.
Nodes can’t reach pods (or vice versa): A single node can host multiple pods, and a GKE cluster can have many nodes to distribute application’s workload for scalability and reliability. Network issues could prevent nodes from communicating with the pods they host.
External communication issues
Pods can’t reach services on the internet: Internet connectivity problems can prevent pods from accessing external APIs, databases, or other resources.
External services can’t reach pods: Services exposed via GKE Load Balancers might be inaccessible from outside the cluster.
Communication beyond Cluster VPCs
Pods can’t reach resources in other VPCs: Connectivity issues may arise when pods need to interact with services in another VPC (within the same project or via VPC peering).
Pods can’t reach on-premises resources: Problems can occur when GKE clusters need to communicate with systems in your company’s data center (for example connecting over VPN or Hybrid Connectivity).
Troubleshooting steps
In the event of a connectivity issue within your Google Kubernetes Engine (GKE) environment, there are specific measures that can be undertaken to rectify the situation. Please reference the troubleshooting tree outlined below for a comprehensive overview of the recommended problem-solving process.
Step 1: Run connectivity tests
Connectivity Tests is a diagnostics tool that lets you check connectivity between network endpoints. It analyzes your configuration and, in some cases, performs live dataplane analysis between the endpoints. It will help verify if the network path is correct and if there is any firewall rule/route that is breaking the connectivity. Please refer to the below video for more details on running connectivity tests:
Step 2: Isolate the issue
Create a GCE VM in the same subnet as your GKE cluster. Test connectivity to the external endpoint from this VM.
If connectivity works from the VM, the issue likely lies within your GKE configuration. If not, focus on VPC networking.
Step 3: Troubleshoot your GKE configuration
Test connectivity from a GKE node. If it works from the node but not from a pod, investigate these areas:
IP Masquerading: Check if it’s enabled, running, and if the ip-masq-agent configmap aligns with your network setup. Please note that traffic to the destinations defined in “nonMasqueradeCIDRs” in the configmap yaml is sent with source as pod ip address instead of node ip address, so the endpoint destination should allow the traffic from pod ip range as well. If there is no configmap for ip-masq-agent and there is only a ip-masq-agent daemon set running, then traffic to all default non-masquerade destinations is sent via pod ip address. For Autopilot clusters, this will be configured using Egress NAT policies.
Network Policies: Review ingress/egress rules for potential blockages. Enable logging if Dataplane V2 is used.
IPtables: Compare rules between working and non-working nodes. You could use “sudo iptables-save” by running it on the particular node for the output.
Service mesh: If you are running Cloud Service Mesh or Istio in your environment, consider testing with disabling istio-proxy injection for a test pod in the namespace and check if the issue still exists. If the connectivity works after disabling sidecar injection, the issue likely lies in the service mesh configuration.
Note: Some of the steps such as testing connection from GKE node or checking ip-tables from GKE node would not work with GKE Autopilot Clusters and are only applicable to Standard Clusters.
Step 4: Pinpoint node-specific issues
If connectivity fails from a specific node:
Compare configurations: Ensure it matches working nodes.
Check resource usage: Look for CPU/memory or conntrack table issues.
Collect sosreport from an erroneous node. This could help to generate RCA.
If the issue narrowed down to GKE nodes you could use the logging filter mentioned below. Look for any common error by filtering it down to a specific timestamp. Logs like connection timeout , OOM kill (oom_watcher), Kubelet is unhealthy, NetworkPluginNotReady etcetera could help to move troubleshooting into the right direction. For more similar queries you could check GKE Node-level queries.
resource.type=”k8s_node”
resource.labels.cluster_name=”GKE_CLUSTER_NAME”
resource.labels.node_name=”NODE_NAME”
Step 5: Address external communication
For external connectivity problems with a private GKE cluster, ensure Cloud NAT is enabled for both pod and node CIDRs.
Step 6: Address control-plane connectivity issues
Connectivity from nodes to the GKE cluster control plane (GKE master endpoint) depends on the type of GKE cluster (Private/Public/PSC based Cluster).
Most of the steps for checking the control plane connectivity is similar to the troubleshooting steps mentioned above for general connectivity issues such as running connectivity tests to the GKE cluster private or public control plane endpoint.
In addition to the above, ensure that the source is allowed in the control plane authorized networks and GKE cluster control plane global access is enabled if the source is in a different region than the GKE cluster.
If there is a need for routing traffic from outside GKE to reach the control plane on its private endpoint only, then ensure that the cluster is created with –enable-private-endpoint. This field indicates that the cluster is managed using the private IP address of the control plane API endpoint. Please note that pods/nodes within the same cluster will always try to connect to GKE master on its private endpoint only irrespective of the public endpoint setting.
If accessing the control plane of a GKE cluster A with public endpoint enabled from another private GKE cluster B (e.g Cloud Composer), pods of cluster B will always try to connect on the public endpoint of cluster A so we need to make sure the private cluster B has Cloud NAT enabled for outside access and Cloud NAT IP ranges are whitelisted in control plane authorized networks on cluster A.
Conclusion
The steps above address common connectivity issues and provide an initial troubleshooting framework. If the problem is complex or intermittent, a deeper analysis is required. This involves collecting packet captures on the affected node (applies only for standard cluster) or pod (applies for both autopilot and standard cluster) at the time of the issue for a comprehensive root cause analysis. Please reach out to Cloud Support for further assistance regarding these issues.
Read More for the details.