Kubernetes Troubleshooting: Network Observability with Mermin • ElastiFlow

The Kernel of the Connectivity Gap

In the era of ephemeral microservices, an IP address is a ghost and a standard trace is a half-truth. I’ve found that troubleshooting Kubernetes without eBPF is like trying to fix a car by looking only at the dashboard; you can see the engine light is on, but you can’t see the spark plugs. By bridging the "Network Gap" with Mermin Flow Traces, we move from guessing if a timeout is an application bug or a network policy drop to seeing the kernel-level reality. This guide explains how to use enriched metadata to turn opaque packet data into actionable service-to-service context, reducing MTTR (Mean Time to Resolution) by treating connectivity as a first-class observability pillar.

Why is Kubernetes network observability traditionally difficult?

Kubernetes network observability is difficult because traditional tools create a visibility gap between high-level application metrics and low-level packet data. Mermin bridges this gap by using eBPF to capture connection-level Flow Traces enriched with real-time Kubernetes metadata, enabling direct correlation between application latency and network events like TCP retransmissions or DNS timeouts.

While Kubernetes is the global standard for container orchestration, interpreting its complex and ephemeral traffic remains a significant obstacle for most engineering teams. Traditional APM traces provide service-level latency but lack the network-layer context required to diagnose if performance issues stem from infrastructure constraints or application code. By leveraging eBPF, Mermin provides transparent, low-overhead visibility into every connection, mapping network flows directly to pods, services, and namespaces. This granular data allows practitioners to move beyond simple metrics and see exactly how traffic moves across the cluster.

How does the "Network Gap" affect the MELT stack?

The network observability gap occurs when the standard MELT stack (Metrics, Events, Logs, and Traces) lacks the connection-level data needed to correlate application performance with network reality. While these pillars effectively monitor application logic, they frequently treat the network as an opaque black box. Consequently, engineers often struggle to determine if a slow trace span results from inefficient code or transient infrastructure issues like packet loss. Mermin resolves this by providing Flow Traces enriched with Kubernetes metadata to bridge the gap between infrastructure and application layers.

What information do Mermin Flow Traces provide for debugging?

Mermin Flow Traces provide bidirectional network visibility by capturing the Network 5-tuple, TCP state flags, and timing data at the kernel level. These traces are enriched with Kubernetes Metadata, such as pod names and namespaces, and exported as OpenTelemetry Spans. This data allows engineers to correlate network performance directly with specific microservices without modifying application code.

What are the core attributes of a Flow Trace?

A Flow Trace is a structured record that represents a bidirectional network conversation between two endpoints. Unlike simple metrics that only provide counters, Flow Traces include the state and context of the connection. Mermin captures these attributes using eBPF to ensure high-fidelity data collection with minimal system overhead.

Attribute Category	Key Data Points	Troubleshooting Value
Network 5-Tuple	Source/Destination IP, Ports, Protocol	Identifies the specific communication path and transport protocol.
Traffic Metrics	Byte/Packet Delta (Forward & Reverse)	Detects throughput imbalances, packet loss, or asymmetric routing.
TCP State Tracking	SYN, FIN, RST, Connection State	Pinpoints exactly where a connection failed during the handshake or transfer.
Flow Identity	Community ID, Flow Direction	Enables cross-tool correlation and identifies which side acted as the client.

How does Kubernetes metadata enrich network data?

In a dynamic Kubernetes environment, raw IP addresses are insufficient for debugging because pods are ephemeral and frequently change their network identity. Mermin solves this by using informers to watch the Kubernetes API and map flows to specific workload identities. This enrichment transforms a "meaningless" IP-to-IP connection into a "meaningful" service-to-service interaction.

Pod Identity: Traces include the Pod name, namespace, and unique UID for precise targeting.
Workload Context: Flows are tagged with the owning Deployment, ReplicaSet, StatefulSet, or DaemonSet.
Service Mapping: Mermin identifies the specific Kubernetes Service that load-balanced the traffic to an endpoint.
Logical Grouping: Custom labels and annotations are preserved, allowing for filtering by environment (e.g., env=prod) or team.

What performance metrics are included for quality analysis?

Beyond basic connectivity, Mermin Flow Traces provide insights into the quality of the network path and the health of the application. By analyzing the timing and flags of packets at the kernel level, engineers can identify performance bottlenecks that are invisible to application-level monitoring.

Connection Integrity: TCP state flags (SYN, FIN, RST) reveal the health of the connection, helping to identify failed handshakes or mid-stream resets caused by infrastructure instability or security policies.
Connection Lifecycle: Flow start and end timestamps allow for the calculation of the exact duration of a network interaction.
Tunneling Details: Mermin provides visibility into encapsulated traffic, including VXLAN, Geneve, and WireGuard metadata used by various CNIs.

Example Flow Trace

Mermin exports data using the OpenTelemetry Protocol (OTLP), which is the industry standard for observability. The following JSON snippet represents a typical Flow Trace span. It demonstrates how network-level bytes and packets are integrated with Kubernetes pod names for a frontend-to-backend connection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
{
  "name": "flow_ipv4_tcp",
  "kind": "Client",
  "startTimeUnixNano": "1727149620000000000",
  "endTimeUnixNano": "1727149680000000000",
  "attributes": [
    { "key": "flow.community_id", "value": { "stringValue": "1:LQU9qZlK+B+2dM2I2n1kI/M5a/g=" } },
    { "key": "flow.direction", "value": { "stringValue": "forward" } },
    { "key": "flow.bytes.delta", "value": { "intValue": "1024" } },
    { "key": "flow.reverse.bytes.delta", "value": { "intValue": "32768" } },
    { "key": "flow.packets.delta", "value": { "intValue": "10" } },
    { "key": "flow.reverse.packets.delta", "value": { "intValue": "85" } },
    { "key": "source.address", "value": { "stringValue": "10.1.1.5" } },
    { "key": "source.port", "value": { "intValue": "54211" } },
    { "key": "source.k8s.pod.name", "value": { "stringValue": "frontend-abcde" } },
    { "key": "source.k8s.namespace.name", "value": { "stringValue": "production" } },
    { "key": "destination.address", "value": { "stringValue": "10.1.2.10" } },
    { "key": "destination.port", "value": { "intValue": "80" } },
    { "key": "destination.k8s.pod.name", "value": { "stringValue": "backend-xyz" } },
    { "key": "network.transport", "value": { "stringValue": "tcp" } },
    { "key": "network.type", "value": { "stringValue": "ipv4" } }
  ]
}

This record shows a 60-second observation window where the frontend pod downloaded significantly more data (32KB) than it sent (1KB). The presence of the Client attribute confirms that the frontend initiated this specific connection. Using this data, a practitioner can quickly verify if a "Slow Backend" report is due to actual backend latency or a throughput bottleneck at the network layer. I recommend consulting the documentation on Mermin network semantic conventions for a complete reference of these attributes and their definitions.

How to identify TCP connection failures in Kubernetes?

Identifying TCP connection failures requires analyzing TCP state flags and connection lifecycles within Mermin Flow Traces. By isolating flows with a SYN flag but no corresponding ACK, or identifying abrupt RST (Reset) flags, engineers can distinguish between service timeouts, network policy blocks, and application crashes. Mermin exports these signals as OpenTelemetry Spans, making them searchable in backends like Grafana Tempo or Elasticsearch.

What are the common TCP failure signatures?

When troubleshooting a distributed system, network failures typically fall into three categories: handshakes that never complete, connections that are forcibly terminated, and packets that simply disappear. By mapping these to specific Flow Trace attributes, practitioners can rapidly determine the root cause of a connectivity issue.

Failure Mode	Attribute Pattern	Probable Cause
Connection Timeout	flow.tcp.flags.tags: ["syn"] (and not "ack")	Traffic blocked by a Network Policy, Security Group, or Firewall.
Connection Refused	flow.tcp.flags.tags: "rst" (with short flow duration)	The target process is not listening on the port, the pod is crashing, or the service is down.
Idle Timeout	flow.end_reason: "idle timeout"	A Load Balancer or Service Mesh proxy closed the connection because no data was sent within the expected window.
Slow Handshake	High flow.tcp.handshake.latency	Significant network congestion or resource contention on the source or destination node.

Analyzing packet count disparity

One of the most effective ways to identify network-layer issues in Elasticsearch is by comparing bidirectional packet counters. In a healthy TCP handshake, you expect to see packets in both the flow.packets.delta and flow.reverse.packets.delta fields. If the reverse count is zero while the forward count is greater than one, it indicates that the client is stuck in a "SYN-SENT" state, retrying a connection that the destination is ignoring.

How to interpret a failed Flow Trace in a dashboard

Visualizing Mermin data in a tool like Elasticsearch or Grafana provides a chronological view of the connection. When a failure occurs, the Flow Trace span will often display specific characteristics that indicate where the communication broke down.

Duration Analysis: A flow with a very short duration and an immediate RST flag suggests a "Connection Refused" error, meaning the packet reached the node but no process was listening.
Packet Count Disparity: If the flow.packets.delta shows multiple packets sent but flow.reverse.packets.delta is zero, the client is retrying the handshake because the server is not responding (typical of a Network Policy drop).
Span Kind Context: Use the SPAN_KIND_CLIENT attribute to verify which pod initiated the request. If a "reverse" reset is seen, the server-side pod or its sidecar proxy is explicitly rejecting the traffic.

By combining these visual cues with structured queries, platform engineers can reduce the Mean Time to Recovery (MTTR) for complex networking issues that would otherwise require deep-dive packet captures.

How to debug Kubernetes DNS resolution with eBPF?

Debugging Kubernetes DNS resolution requires analyzing UDP Flow Traces directed at port 53 to identify timeouts, latency, or configuration-driven query volume. By capturing these ephemeral interactions at the kernel level, Mermin provides visibility into the communication between application pods and CoreDNS, allowing engineers to distinguish between DNS server saturation and client-side misconfigurations like ndots overhead using bidirectional packet counters.

Identifying DNS failure signatures in Flow Traces

In Kubernetes, DNS issues typically manifest as intermittent application timeouts. Because DNS primarily utilizes the UDP protocol, there are no TCP-style flags to inspect. Instead, practitioners must analyze the bidirectional packet counters and the duration of the OTLP span to diagnose resolution bottlenecks.

DNS Issue	Flow Trace Attribute Signature	Root Cause
DNS Timeout	flow.packets.delta > 0 AND flow.reverse.packets.delta: 0	CoreDNS is not responding, or a Network Policy is blocking UDP/53.
DNS Latency	High span duration for destination.port: 53	CoreDNS pods are CPU-throttled or experiencing high request volume.
ndots Overhead	High frequency of unique flow.community_id from one source	The ndots:5 default is causing too many redundant queries for external names.
Resolution Failure	destination.address is external (No K8s Metadata)	The pod is bypassing internal DNS or failing to resolve a service name.

Configuring Mermin for DNS Visibility

To effectively debug DNS, Mermin must be configured to monitor the interfaces where DNS traffic traverses. This includes veth pairs for pod-to-pod traffic and tunnel interfaces for inter-node communication. The following HCL configuration ensures Mermin captures traffic on the necessary interfaces and exports the resulting spans to Elasticsearch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Mermin configuration for DNS-ready observability
discovery "instrument" {
  interfaces = [
    "veth*",     # Standard Pod-to-Pod traffic
    "flannel*",  # Flannel CNI overlays  
    "cali*",     # Calico CNI interfaces
    "cilium_*"   # Cilium eBPF-based interfaces
  ]
}


export "traces" {
  otlp = {
    endpoint = "http://otel-collector.monitoring:4317"
    protocol = "grpc"
  }
}

Querying DNS flows with KQL in Elasticsearch

Once DNS traffic is indexed in Elasticsearch, you can use Kibana Query Language (KQL) to isolate the relationship between your application pods and the kube-dns service. Because Mermin enriches traces with Kubernetes metadata, you can identify the specific source pod causing a surge in DNS traffic without knowing its IP address.

To find all unresponsive DNS queries (Potential Timeouts):

1
network.transport: "udp" AND destination.port: 53 AND flow.packets.delta > 0 AND flow.reverse.packets.delta: 0

To isolate DNS traffic originating from a specific microservice:

1
source.k8s.pod.name: "frontend-*" AND destination.port: 53

To detect "ndots" overhead (Identifying search domain traversal):

Search for a high volume of unique flow.community_id values within a short time window originating from a single source.k8s.pod.name. While Mermin does not inspect the DNS packet payload, a surge in distinct flows to port 53 from one pod is a definitive signature of a client traversing its DNS search path (e.g., trying google.com.svc.cluster.local before google.com).

Optimizing DNS Performance with Flow Data

By analyzing the volume and duration of DNS Flow Traces, platform engineers can make data-driven decisions to improve cluster stability:

Scale CoreDNS: If span durations for port 53 are trending upward cluster-wide, increase the number of CoreDNS replicas to handle the load.
Identify Heavy Talkers: Use Mermin metadata to find the specific Deployments generating the highest DNS query volume and implement a NodeLocal DNSCache to reduce cross-node traffic.
Audit External Traffic: If you see frequent port 53 flows to external destination addresses that lack Kubernetes metadata, your pods may be bypassing internal DNS, potentially leading to security or routing issues.

How to verify Kubernetes Network Policies with Flow Traces?

Verifying Kubernetes Network Policies requires comparing your declarative security rules against the real-time Flow Traces captured by Mermin. By analyzing kernel-level eBPF data, engineers can confirm if a policy is silently dropping unauthorized traffic (seen as a "syn" flag with no response) or if misconfigured labels are allowing unintended access. Mermin enriches these traces with NetworkPolicy metadata, providing definitive proof of enforcement that traditional metrics lack.

How does eBPF detect policy enforcement?

Traditional Kubernetes auditing often relies on log-based reporting that can be delayed or missing connection-level context. Mermin uses eBPF to capture every connection attempt at the network interface level. This allows platform engineers to see unauthorized connection attempts that are dropped by the CNI before they ever reach the application container.

By inspecting the flow.tcp.flags.tags array and bidirectional packet counters, you can distinguish between a successful connection and a policy-driven drop. A Network Policy configured to "Deny" traffic typically results in a client pod sending multiple SYN packets while the flow.reverse.packets.delta remains at zero. Mermin documents these attempts as Flow Traces, tagging them with the specific source pod and destination namespace involved in the violation.

What are the common policy validation signatures?

Validating a NetworkPolicy involves looking for specific packet and flag patterns in your observability backend. The following table illustrates how different security outcomes appear within Mermin's data model.

Security Scenario	Flow Trace Attribute Signature	Troubleshooting Conclusion
Correct Deny Rule	flow.tcp.flags.tags: ["syn"] AND flow.reverse.packets.delta: 0	The policy is successfully dropping packets; the destination is invisible to the source.
Misconfigured Allow	Bidirectional packets between pods in different namespaces	A label selector is too broad or a policy is missing, allowing unauthorized access.
Policy Bypass	Flow exists with source.address but no Kubernetes metadata	The pod is likely using hostNetwork: true, bypassing standard CNI-level Network Policies.
Active Rejection	flow.tcp.flags.tags: ["syn"] followed by flow.reverse.tcp.flags.tags: ["rst"]	The CNI is actively rejecting the connection (sending a Reset) rather than silently dropping it.

Configuring Mermin for Network Policy Visibility

To verify security rules, Mermin must be configured to watch the Kubernetes API for NetworkPolicy resources and associate them with specific flows. The following HCL configuration ensures the k8s_decorator has the necessary metadata to enrich every network interaction with policy context.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Mermin configuration for Network Policy auditing
discovery "instrument" {
  interfaces = [
    "veth*",
    "cali*",
    "cilium_*",
    "flannel*",
    "lxc*"
  ]
}

# Full Kubernetes metadata enrichment
discovery "informer" "k8s" {
  selectors = [
    { kind = "Pod" },
    { kind = "Service" },
    { kind = "Namespace" },
    { kind = "NetworkPolicy" } # Essential for policy auditing
  ]
}

attributes "source" "k8s" {
  extract {
    metadata = [
      "[*].metadata.name",
      "[*].metadata.namespace",
      "pod.metadata.uid"
    ]
  }
  
  association {
    pod = {
      sources = [
        { from = "source.ip", to = ["status.podIP", "status.podIPs[*]", "status.hostIP", "status.hostIPs[*]"] },
        { from = "source.port", to = ["spec.containers[*].ports[*].containerPort", "spec.containers[*].ports[*].hostPort"] },
        { from = "network.transport", to = ["spec.containers[*].ports[*].protocol"] }
      ]
    }
    networkpolicy = {
      sources = [
        { from = "source.ip", to = ["spec.ingress[*].from[*].ipBlock.cidr", "spec.egress[*].to[*].ipBlock.cidr"] },
        { from = "source.port", to = ["spec.ingress[*].ports[*].port", "spec.egress[*].ports[*].port"] },
        { from = "network.transport", to = ["spec.ingress[*].ports[*].protocol", "spec.egress[*].ports[*].protocol"] }
      ]
    }
  }
}

attributes "destination" "k8s" {
  extract {
    metadata = [
      "[*].metadata.name",
      "[*].metadata.namespace", 
      "pod.metadata.uid"
    ]
  }
  
  association {
    pod = {
      sources = [
        { from = "destination.ip", to = ["status.podIP", "status.podIPs[*]", "status.hostIP", "status.hostIPs[*]"] },
        { from = "destination.port", to = ["spec.containers[*].ports[*].containerPort", "spec.containers[*].ports[*].hostPort"] },
        { from = "network.transport", to = ["spec.containers[*].ports[*].protocol"] }
      ]
    }
    networkpolicy = {
      sources = [
        { from = "destination.ip", to = ["spec.ingress[*].from[*].ipBlock.cidr", "spec.egress[*].to[*].ipBlock.cidr"] },
        { from = "destination.port", to = ["spec.ingress[*].ports[*].port", "spec.egress[*].ports[*].port"] },
        { from = "network.transport", to = ["spec.ingress[*].ports[*].protocol", "spec.egress[*].ports[*].protocol"] }
      ]
    }
  }
}

export "traces" {
  otlp = {
    endpoint = "otel-collector.monitoring:4317"
    protocol = "grpc"
  }
}

How to correlate Flow Traces with kubectl data?

To confirm that a blocked flow is governed by a specific rule, correlate the source.k8s.pod.name or destination.k8s.namespace.name from the Flow Trace with your active policies. Use kubectl to list and describe the policies applied to the affected namespace to verify if the pod labels match the podSelector defined in your YAML.

List active Network Policies in a target namespace:

1
kubectl get networkpolicy -n elastiflow

Example Output:

1
2
NAME                   POD-SELECTOR   AGE
allow-frontend-to-db   app=mongodb    8m57s

Describe a specific policy to verify the ingress/egress logic:

1
kubectl describe networkpolicy allow-frontend-to-db -n elastiflow