Infrastructure December 5, 2025 3 min read

Achieving P99 Latency Targets at the Edge

Optimizing for tail latency requires bypassing the traditional OS networking stack and processing packets directly at the NIC driver level via eBPF.

On this page

Averages are the enemy of reliable distributed systems; optimizing for the mean obscures the severe tail-latency penalties that degrade the experience of the most vulnerable user segments. When an edge gateway must process millions of concurrent connections, the traditional Linux networking stack introduces unpredictable delays due to context switching, lock contention, and interrupt handling. Achieving single-digit millisecond P99 latency targets requires abandoning the kernel’s standard socket API and processing packets directly at the network interface controller (NIC) level.

The Fallacy of Averages

A system might boast an average response time of 20 milliseconds, but if the P99 latency spikes to 800 milliseconds due to garbage collection pauses or kernel scheduling delays, the application is functionally broken for a significant portion of users. In microservice architectures, these tail latencies compound multiplicatively; a single slow dependency can drag the entire request chain into a timeout. Eradicating these spikes requires deterministic execution environments that bypass the non-deterministic nature of the general-purpose OS kernel.

Kernel Bypass and XDP

eXpress Data Path (XDP) provides a programmable, high-performance packet processing framework within the Linux kernel. By attaching an eBPF (extended Berkeley Packet Filter) program directly to the NIC driver, the gateway can inspect, filter, and route packets before they ever reach the kernel’s networking stack or allocate an sk_buff structure. This early-drop mechanism allows the edge node to silently discard malicious SYN floods or malformed protocol headers at near-hardware speeds, consuming virtually zero CPU cycles and preserving resources for legitimate traffic.

Zero-Copy and User-Space Networking

For packets that pass the XDP filter, achieving ultra-low latency requires moving the payload to user-space without memory copying. Modern edge proxies utilize zero-copy techniques, mapping the NIC’s DMA rings directly into the application’s memory space. Combined with user-space networking stacks like DPDK or io_uring, the application can process massive throughput with strict CPU pinning, entirely eliminating the context-switching overhead that traditionally destroys P99 latency guarantees.

// eBPF XDP program for early-drop packet filtering at the edge
// Executes directly in the NIC driver context before kernel allocation

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>

SEC("xdp_edge_filter")
int xdp_drop_malformed_syn(struct xdp_md *ctx) {
    void *data_end = (void *)(long)ctx->data_end;
    void *data = (void *)(long)ctx->data;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_PASS;

    if (eth->h_proto != __constant_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;

    if (ip->protocol != IPPROTO_TCP) return XDP_PASS;

    struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
    if ((void *)(tcp + 1) > data_end) return XDP_PASS;

    // Drop TCP packets with invalid flag combinations (e.g., SYN+FIN)
    if (tcp->syn && tcp->fin) {
        return XDP_DROP; // Silently discard at hardware level
    }

    return XDP_PASS; // Forward to kernel stack for legitimate traffic
}

char _license[] SEC("license") = "GPL";

Summary

Achieving strict P99 latency targets requires moving beyond traditional user-space proxies and embracing kernel-bypass technologies. By leveraging eBPF and XDP to process packets at the NIC driver level, edge gateways can eliminate OS-level jitter and neutralize volumetric abuse before it consumes compute resources. SRRRS utilizes advanced eBPF pipelines across its global edge network, ensuring deterministic, ultra-low latency processing for the most demanding distributed workloads.