
A Practitioner’s Guide to Troubleshooting Kubernetes with Mermin eBPF
February 10, 2026
The Kernel of the Connectivity Gap
In the era of ephemeral microservices, an IP address is a ghost and a standard trace is a half-truth. I’ve found that troubleshooting Kubernetes without eBPF is like trying to fix a car by looking only at the dashboard; you can see the engine light is on, but you can’t see the spark plugs. By bridging the "Network Gap" with Mermin Flow Traces, we move from guessing if a timeout is an application bug or a network policy drop to seeing the kernel-level reality. This guide explains how to use enriched metadata to turn opaque packet data into actionable service-to-service context, reducing MTTR (Mean Time to Resolution) by treating connectivity as a first-class observability pillar.
Why is Kubernetes network observability traditionally difficult?
Kubernetes network observability is difficult because traditional tools create a visibility gap between high-level application metrics and low-level packet data. Mermin bridges this gap by using eBPF to capture connection-level Flow Traces enriched with real-time Kubernetes metadata, enabling direct correlation between application latency and network events like TCP retransmissions or DNS timeouts.
While Kubernetes is the global standard for container orchestration, interpreting its complex and ephemeral traffic remains a significant obstacle for most engineering teams. Traditional APM traces provide service-level latency but lack the network-layer context required to diagnose if performance issues stem from infrastructure constraints or application code. By leveraging eBPF, Mermin provides transparent, low-overhead visibility into every connection, mapping network flows directly to pods, services, and namespaces. This granular data allows practitioners to move beyond simple metrics and see exactly how traffic moves across the cluster.
How does the "Network Gap" affect the MELT stack?
The network observability gap occurs when the standard MELT stack (Metrics, Events, Logs, and Traces) lacks the connection-level data needed to correlate application performance with network reality. While these pillars effectively monitor application logic, they frequently treat the network as an opaque black box. Consequently, engineers often struggle to determine if a slow trace span results from inefficient code or transient infrastructure issues like packet loss. Mermin resolves this by providing Flow Traces enriched with Kubernetes metadata to bridge the gap between infrastructure and application layers.
What information do Mermin Flow Traces provide for debugging?
Mermin Flow Traces provide bidirectional network visibility by capturing the Network 5-tuple, TCP state flags, and timing data at the kernel level. These traces are enriched with Kubernetes Metadata, such as pod names and namespaces, and exported as OpenTelemetry Spans. This data allows engineers to correlate network performance directly with specific microservices without modifying application code.
What are the core attributes of a Flow Trace?
A Flow Trace is a structured record that represents a bidirectional network conversation between two endpoints. Unlike simple metrics that only provide counters, Flow Traces include the state and context of the connection. Mermin captures these attributes using eBPF to ensure high-fidelity data collection with minimal system overhead.
Attribute Category | Key Data Points | Troubleshooting Value |
Network 5-Tuple | Source/Destination IP, Ports, Protocol | Identifies the specific communication path and transport protocol. |
Traffic Metrics | Byte/Packet Delta (Forward & Reverse) | Detects throughput imbalances, packet loss, or asymmetric routing. |
TCP State Tracking | SYN, FIN, RST, Connection State | Pinpoints exactly where a connection failed during the handshake or transfer. |
Flow Identity | Community ID, Flow Direction | Enables cross-tool correlation and identifies which side acted as the client. |
How does Kubernetes metadata enrich network data?
In a dynamic Kubernetes environment, raw IP addresses are insufficient for debugging because pods are ephemeral and frequently change their network identity. Mermin solves this by using informers to watch the Kubernetes API and map flows to specific workload identities. This enrichment transforms a "meaningless" IP-to-IP connection into a "meaningful" service-to-service interaction.
Pod Identity: Traces include the Pod name, namespace, and unique UID for precise targeting.
Workload Context: Flows are tagged with the owning Deployment, ReplicaSet, StatefulSet, or DaemonSet.
Service Mapping: Mermin identifies the specific Kubernetes Service that load-balanced the traffic to an endpoint.
Logical Grouping: Custom labels and annotations are preserved, allowing for filtering by environment (e.g., env=prod) or team.
What performance metrics are included for quality analysis?
Beyond basic connectivity, Mermin Flow Traces provide insights into the quality of the network path and the health of the application. By analyzing the timing and flags of packets at the kernel level, engineers can identify performance bottlenecks that are invisible to application-level monitoring.
Connection Integrity: TCP state flags (SYN, FIN, RST) reveal the health of the connection, helping to identify failed handshakes or mid-stream resets caused by infrastructure instability or security policies.
Connection Lifecycle: Flow start and end timestamps allow for the calculation of the exact duration of a network interaction.
Tunneling Details: Mermin provides visibility into encapsulated traffic, including VXLAN, Geneve, and WireGuard metadata used by various CNIs.
Example Flow Trace
Mermin exports data using the OpenTelemetry Protocol (OTLP), which is the industry standard for observability. The following JSON snippet represents a typical Flow Trace span. It demonstrates how network-level bytes and packets are integrated with Kubernetes pod names for a frontend-to-backend connection.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23{ "name": "flow_ipv4_tcp", "kind": "Client", "startTimeUnixNano": "1727149620000000000", "endTimeUnixNano": "1727149680000000000", "attributes": [ { "key": "flow.community_id", "value": { "stringValue": "1:LQU9qZlK+B+2dM2I2n1kI/M5a/g=" } }, { "key": "flow.direction", "value": { "stringValue": "forward" } }, { "key": "flow.bytes.delta", "value": { "intValue": "1024" } }, { "key": "flow.reverse.bytes.delta", "value": { "intValue": "32768" } }, { "key": "flow.packets.delta", "value": { "intValue": "10" } }, { "key": "flow.reverse.packets.delta", "value": { "intValue": "85" } }, { "key": "source.address", "value": { "stringValue": "10.1.1.5" } }, { "key": "source.port", "value": { "intValue": "54211" } }, { "key": "source.k8s.pod.name", "value": { "stringValue": "frontend-abcde" } }, { "key": "source.k8s.namespace.name", "value": { "stringValue": "production" } }, { "key": "destination.address", "value": { "stringValue": "10.1.2.10" } }, { "key": "destination.port", "value": { "intValue": "80" } }, { "key": "destination.k8s.pod.name", "value": { "stringValue": "backend-xyz" } }, { "key": "network.transport", "value": { "stringValue": "tcp" } }, { "key": "network.type", "value": { "stringValue": "ipv4" } } ] }
This record shows a 60-second observation window where the frontend pod downloaded significantly more data (32KB) than it sent (1KB). The presence of the Client attribute confirms that the frontend initiated this specific connection. Using this data, a practitioner can quickly verify if a "Slow Backend" report is due to actual backend latency or a throughput bottleneck at the network layer. I recommend consulting the documentation on Mermin network semantic conventions for a complete reference of these attributes and their definitions.
How to identify TCP connection failures in Kubernetes?
Identifying TCP connection failures requires analyzing TCP state flags and connection lifecycles within Mermin Flow Traces. By isolating flows with a SYN flag but no corresponding ACK, or identifying abrupt RST (Reset) flags, engineers can distinguish between service timeouts, network policy blocks, and application crashes. Mermin exports these signals as OpenTelemetry Spans, making them searchable in backends like Grafana Tempo or Elasticsearch.
What are the common TCP failure signatures?
When troubleshooting a distributed system, network failures typically fall into three categories: handshakes that never complete, connections that are forcibly terminated, and packets that simply disappear. By mapping these to specific Flow Trace attributes, practitioners can rapidly determine the root cause of a connectivity issue.
Failure Mode | Attribute Pattern | Probable Cause |
Connection Timeout | flow.tcp.flags.tags: ["syn"] (and not "ack") | Traffic blocked by a Network Policy, Security Group, or Firewall. |
Connection Refused | flow.tcp.flags.tags: "rst" (with short flow duration) | The target process is not listening on the port, the pod is crashing, or the service is down. |
Idle Timeout | flow.end_reason: "idle timeout" | A Load Balancer or Service Mesh proxy closed the connection because no data was sent within the expected window. |
Slow Handshake | High flow.tcp.handshake.latency | Significant network congestion or resource contention on the source or destination node. |
Analyzing packet count disparity
One of the most effective ways to identify network-layer issues in Elasticsearch is by comparing bidirectional packet counters. In a healthy TCP handshake, you expect to see packets in both the flow.packets.delta and flow.reverse.packets.delta fields. If the reverse count is zero while the forward count is greater than one, it indicates that the client is stuck in a "SYN-SENT" state, retrying a connection that the destination is ignoring.
How to interpret a failed Flow Trace in a dashboard
Visualizing Mermin data in a tool like Elasticsearch or Grafana provides a chronological view of the connection. When a failure occurs, the Flow Trace span will often display specific characteristics that indicate where the communication broke down.
Duration Analysis: A flow with a very short duration and an immediate RST flag suggests a "Connection Refused" error, meaning the packet reached the node but no process was listening.
Packet Count Disparity: If the flow.packets.delta shows multiple packets sent but flow.reverse.packets.delta is zero, the client is retrying the handshake because the server is not responding (typical of a Network Policy drop).
Span Kind Context: Use the SPAN_KIND_CLIENT attribute to verify which pod initiated the request. If a "reverse" reset is seen, the server-side pod or its sidecar proxy is explicitly rejecting the traffic.
By combining these visual cues with structured queries, platform engineers can reduce the Mean Time to Recovery (MTTR) for complex networking issues that would otherwise require deep-dive packet captures.
How to debug Kubernetes DNS resolution with eBPF?
Debugging Kubernetes DNS resolution requires analyzing UDP Flow Traces directed at port 53 to identify timeouts, latency, or configuration-driven query volume. By capturing these ephemeral interactions at the kernel level, Mermin provides visibility into the communication between application pods and CoreDNS, allowing engineers to distinguish between DNS server saturation and client-side misconfigurations like ndots overhead using bidirectional packet counters.
Identifying DNS failure signatures in Flow Traces
In Kubernetes, DNS issues typically manifest as intermittent application timeouts. Because DNS primarily utilizes the UDP protocol, there are no TCP-style flags to inspect. Instead, practitioners must analyze the bidirectional packet counters and the duration of the OTLP span to diagnose resolution bottlenecks.
DNS Issue | Flow Trace Attribute Signature | Root Cause |
DNS Timeout | flow.packets.delta > 0 AND flow.reverse.packets.delta: 0 | CoreDNS is not responding, or a Network Policy is blocking UDP/53. |
DNS Latency | High span duration for destination.port: 53 | CoreDNS pods are CPU-throttled or experiencing high request volume. |
ndots Overhead | High frequency of unique flow.community_id from one source | The ndots:5 default is causing too many redundant queries for external names. |
Resolution Failure | destination.address is external (No K8s Metadata) | The pod is bypassing internal DNS or failing to resolve a service name. |
Configuring Mermin for DNS Visibility
To effectively debug DNS, Mermin must be configured to monitor the interfaces where DNS traffic traverses. This includes veth pairs for pod-to-pod traffic and tunnel interfaces for inter-node communication. The following HCL configuration ensures Mermin captures traffic on the necessary interfaces and exports the resulting spans to Elasticsearch.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17# Mermin configuration for DNS-ready observability discovery "instrument" { interfaces = [ "veth*", # Standard Pod-to-Pod traffic "flannel*", # Flannel CNI overlays "cali*", # Calico CNI interfaces "cilium_*" # Cilium eBPF-based interfaces ] } export "traces" { otlp = { endpoint = "http://otel-collector.monitoring:4317" protocol = "grpc" } }
Querying DNS flows with KQL in Elasticsearch
Once DNS traffic is indexed in Elasticsearch, you can use Kibana Query Language (KQL) to isolate the relationship between your application pods and the kube-dns service. Because Mermin enriches traces with Kubernetes metadata, you can identify the specific source pod causing a surge in DNS traffic without knowing its IP address.
To find all unresponsive DNS queries (Potential Timeouts):
1network.transport: "udp" AND destination.port: 53 AND flow.packets.delta > 0 AND flow.reverse.packets.delta: 0
To isolate DNS traffic originating from a specific microservice:
1source.k8s.pod.name: "frontend-*" AND destination.port: 53
To detect "ndots" overhead (Identifying search domain traversal):
Search for a high volume of unique flow.community_id values within a short time window originating from a single source.k8s.pod.name. While Mermin does not inspect the DNS packet payload, a surge in distinct flows to port 53 from one pod is a definitive signature of a client traversing its DNS search path (e.g., trying google.com.svc.cluster.local before google.com).
Optimizing DNS Performance with Flow Data
By analyzing the volume and duration of DNS Flow Traces, platform engineers can make data-driven decisions to improve cluster stability:
Scale CoreDNS: If span durations for port 53 are trending upward cluster-wide, increase the number of CoreDNS replicas to handle the load.
Identify Heavy Talkers: Use Mermin metadata to find the specific Deployments generating the highest DNS query volume and implement a NodeLocal DNSCache to reduce cross-node traffic.
Audit External Traffic: If you see frequent port 53 flows to external destination addresses that lack Kubernetes metadata, your pods may be bypassing internal DNS, potentially leading to security or routing issues.
How to verify Kubernetes Network Policies with Flow Traces?
Verifying Kubernetes Network Policies requires comparing your declarative security rules against the real-time Flow Traces captured by Mermin. By analyzing kernel-level eBPF data, engineers can confirm if a policy is silently dropping unauthorized traffic (seen as a "syn" flag with no response) or if misconfigured labels are allowing unintended access. Mermin enriches these traces with NetworkPolicy metadata, providing definitive proof of enforcement that traditional metrics lack.
How does eBPF detect policy enforcement?
Traditional Kubernetes auditing often relies on log-based reporting that can be delayed or missing connection-level context. Mermin uses eBPF to capture every connection attempt at the network interface level. This allows platform engineers to see unauthorized connection attempts that are dropped by the CNI before they ever reach the application container.
By inspecting the flow.tcp.flags.tags array and bidirectional packet counters, you can distinguish between a successful connection and a policy-driven drop. A Network Policy configured to "Deny" traffic typically results in a client pod sending multiple SYN packets while the flow.reverse.packets.delta remains at zero. Mermin documents these attempts as Flow Traces, tagging them with the specific source pod and destination namespace involved in the violation.
What are the common policy validation signatures?
Validating a NetworkPolicy involves looking for specific packet and flag patterns in your observability backend. The following table illustrates how different security outcomes appear within Mermin's data model.
Security Scenario | Flow Trace Attribute Signature | Troubleshooting Conclusion |
Correct Deny Rule | flow.tcp.flags.tags: ["syn"] AND flow.reverse.packets.delta: 0 | The policy is successfully dropping packets; the destination is invisible to the source. |
Misconfigured Allow | Bidirectional packets between pods in different namespaces | A label selector is too broad or a policy is missing, allowing unauthorized access. |
Policy Bypass | Flow exists with source.address but no Kubernetes metadata | The pod is likely using hostNetwork: true, bypassing standard CNI-level Network Policies. |
Active Rejection | flow.tcp.flags.tags: ["syn"] followed by flow.reverse.tcp.flags.tags: ["rst"] | The CNI is actively rejecting the connection (sending a Reset) rather than silently dropping it. |
Configuring Mermin for Network Policy Visibility
To verify security rules, Mermin must be configured to watch the Kubernetes API for NetworkPolicy resources and associate them with specific flows. The following HCL configuration ensures the k8s_decorator has the necessary metadata to enrich every network interaction with policy context.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81# Mermin configuration for Network Policy auditing discovery "instrument" { interfaces = [ "veth*", "cali*", "cilium_*", "flannel*", "lxc*" ] } # Full Kubernetes metadata enrichment discovery "informer" "k8s" { selectors = [ { kind = "Pod" }, { kind = "Service" }, { kind = "Namespace" }, { kind = "NetworkPolicy" } # Essential for policy auditing ] } attributes "source" "k8s" { extract { metadata = [ "[*].metadata.name", "[*].metadata.namespace", "pod.metadata.uid" ] } association { pod = { sources = [ { from = "source.ip", to = ["status.podIP", "status.podIPs[*]", "status.hostIP", "status.hostIPs[*]"] }, { from = "source.port", to = ["spec.containers[*].ports[*].containerPort", "spec.containers[*].ports[*].hostPort"] }, { from = "network.transport", to = ["spec.containers[*].ports[*].protocol"] } ] } networkpolicy = { sources = [ { from = "source.ip", to = ["spec.ingress[*].from[*].ipBlock.cidr", "spec.egress[*].to[*].ipBlock.cidr"] }, { from = "source.port", to = ["spec.ingress[*].ports[*].port", "spec.egress[*].ports[*].port"] }, { from = "network.transport", to = ["spec.ingress[*].ports[*].protocol", "spec.egress[*].ports[*].protocol"] } ] } } } attributes "destination" "k8s" { extract { metadata = [ "[*].metadata.name", "[*].metadata.namespace", "pod.metadata.uid" ] } association { pod = { sources = [ { from = "destination.ip", to = ["status.podIP", "status.podIPs[*]", "status.hostIP", "status.hostIPs[*]"] }, { from = "destination.port", to = ["spec.containers[*].ports[*].containerPort", "spec.containers[*].ports[*].hostPort"] }, { from = "network.transport", to = ["spec.containers[*].ports[*].protocol"] } ] } networkpolicy = { sources = [ { from = "destination.ip", to = ["spec.ingress[*].from[*].ipBlock.cidr", "spec.egress[*].to[*].ipBlock.cidr"] }, { from = "destination.port", to = ["spec.ingress[*].ports[*].port", "spec.egress[*].ports[*].port"] }, { from = "network.transport", to = ["spec.ingress[*].ports[*].protocol", "spec.egress[*].ports[*].protocol"] } ] } } } export "traces" { otlp = { endpoint = "otel-collector.monitoring:4317" protocol = "grpc" } }
How to correlate Flow Traces with kubectl data?
To confirm that a blocked flow is governed by a specific rule, correlate the source.k8s.pod.name or destination.k8s.namespace.name from the Flow Trace with your active policies. Use kubectl to list and describe the policies applied to the affected namespace to verify if the pod labels match the podSelector defined in your YAML.
List active Network Policies in a target namespace:
1kubectl get networkpolicy -n elastiflow
Example Output:
1 2NAME POD-SELECTOR AGE allow-frontend-to-db app=mongodb 8m57s
Describe a specific policy to verify the ingress/egress logic:
1kubectl describe networkpolicy allow-frontend-to-db -n elastiflow
Example Output:
1 2 3 4 5 6 7 8 9 10 11 12 13Name: allow-frontend-to-db Namespace: elastiflow Created on: 2026-01-28 12:37:24 -0300 -03 Labels: app=database Annotations: <none> Spec: PodSelector: app=mongodb Allowing ingress traffic: To Port: <any> (traffic allowed to all ports) From: PodSelector: app=frontend Not affecting egress traffic Policy Types: Ingress
Querying blocked traffic in Elasticsearch
Once Mermin is exporting enriched data, use Kibana to audit security events. Because Mermin provides the flow.tcp.flags.tags and flow.reverse.packets.delta attributes, you can write high-signal queries that isolate unauthorized access attempts.
To find "SYN" attempts blocked by policy (No response):
network.transport: "tcp" AND flow.tcp.flags.tags: "syn" AND flow.reverse.packets.delta: 0To detect unauthorized pods communicating with your database:
destination.k8s.pod.name: "mongodb-*" AND NOT source.k8s.pod.name: "frontend-*"To audit cross-namespace communication:
NOT source.k8s.namespace.name: "elastiflow" AND destination.k8s.namespace.name: "elastiflow"
By combining the kernel-level reality of eBPF with the intent defined in your Kubernetes Network Policies, you can ensure your cluster remains compliant without the guesswork associated with traditional log-based auditing.
Resources
Stay connected
Sign up to stay connected and receive the latest content and updates from us!