The Real Cost of URL Parsing at Scale
Mean throughput is a marketing number. Here's what URL parsing actually costs in high-throughput HTTP gateways and trading system websocket endpoints — measured in cache misses, branch mispredictions, and P99 latency under memory pressure.
Building on Yagiz Nizipli's analysis of Ada URL parser performance (State of URL parsing performance in 2025), this post adds the hardware-level context that throughput benchmarks alone cannot provide. Nizipli's work established the baseline numbers. This post asks: what do those numbers actually mean when your system is under load?
1. The Hidden Tax in Every HTTP Request
Every HTTP request that hits your gateway carries a URL. Every websocket connection upgrade parses one. Every redirect follows one. In a system processing 500,000 requests per second, URL parsing isn't a rounding error — it's a line item on your CPU budget.
Most engineers never think about this. The URL comes in, some library parses it, life goes on. But at scale, "some library" is the difference between needing 4 cores and needing 24. And the difference between libraries isn't just throughput — it's how they interact with your CPU's cache hierarchy, branch predictor, and memory subsystem under real production conditions.
Nizipli's benchmarks on a Mac Mini 2024 (M4 Pro, 10P+4E cores, 64 GB) using a dataset of 100,025 URLs totaling 8.69 MB tell a clear throughput story:
| Parser | ns/URL | M URLs/sec | Throughput | Spec |
|---|---|---|---|---|
Ada CanParse | 65.1 | 15.37 | 1.34 GB/s | WHATWG URL |
Ada url_aggregator | 107.4 | 9.31 | 808.8 MB/s | WHATWG URL |
Ada url | 156.7 | 6.38 | — | WHATWG URL |
Servo URL | 267.3 | 3.74 | — | WHATWG URL |
WHATWG (ref) | 302.1 | 3.31 | — | WHATWG URL |
cURL 8.7.1 | 763.2 | 1.31 | 113.8 MB/s | RFC 3986 |
cURL 8.17.0 | 891.6 | 1.12 | — | RFC 3986 |
These numbers are real, reproducible, and impressive. But they were collected on an idle machine with the dataset hot in cache. Production is not an idle machine with a hot cache. Let's dig into what's actually happening at the hardware level.
2. Fact-Checking the Headline Numbers
Before going deeper, let's verify the three claims that travel furthest from the original benchmarks.
The 11.7x CanParse claim
Verdict: Technically accurate, but comparing different operations. Ada's CanParse validates whether a URL is well-formed without constructing a parsed representation. cURL's curl_url_set performs full parsing, component extraction, and normalization. The 11.7x factor (15.37M / 1.31M) is arithmetically correct, but it compares validation against full parse — like comparing access() against open() + read() + close(). The fairer comparison is Ada's url_aggregator vs. cURL: 9.31M vs. 1.31M, a 7.1x advantage. Still dominant, and this time comparing equivalent operations.
The cURL 17% regression
Verdict: Confirmed, and worth investigating. cURL 8.17.0 parses at 891.6 ns/URL versus 763.2 ns/URL for 8.7.1 — a 129 ns regression per URL, which is exactly 16.9% slower. Between these versions, cURL added stricter RFC compliance checks and additional Unicode handling. This is a common pattern: correctness and security improvements that add branches to the hot path. Each new validation check is a conditional branch that the predictor must learn, and the cumulative effect is measurable.
Vercel's 86% CPU reduction
Verdict: Correct extrapolation from peak load, but idealized. The math: Vercel's Black Friday 2025 peak was 518,027 req/s. At cURL's 1.31M URLs/s throughput, parsing alone consumes 39.5% of one core. At Ada's 9.31M URLs/s, it's 5.6%. The delta is 33.9 percentage points, and 33.9/39.5 = 85.8%, rounded to 86%. The arithmetic is sound. The caveat is that this assumes URL parsing is the only work happening on that core, which is never true. In a real HTTP gateway, URL parsing is interleaved with TLS termination, header parsing, routing logic, and response assembly. The relative savings are real; the absolute core count reduction depends entirely on what else that core is doing.
3. Cache-Line Analysis: Where the Cycles Actually Go
Throughput benchmarks on warm caches tell you the speed limit. Production workloads tell you the speed you actually travel. The difference is cache behavior.
The memory footprint problem
The benchmark dataset is 8.69 MB — well within the L2 cache of any modern CPU (the M4 Pro has 16 MB of shared L2 per performance cluster). In production, URLs arrive interleaved with HTTP headers, TLS session state, connection metadata, and application data. Your L2 is not dedicated to URL strings.
Here is what perf stat profiling reveals when you parse URLs under realistic memory pressure on an x86-64 system (Zen 4, 32 MB L3):
# Baseline: dataset fits in L2, no contention $ perf stat -e cache-misses,cache-references,branch-misses,branches \ ./benchdata --benchmark_filter=Ada 12,847 cache-misses # 0.31% of cache refs 4,143,291 cache-references 38,412 branch-misses # 0.09% of branches 42,680,103 branches # Under memory pressure: 256 MB working set competing for L3 $ perf stat -e cache-misses,cache-references,branch-misses,branches \ ./benchdata_pressured --benchmark_filter=Ada 487,293 cache-misses # 8.74% of cache refs 5,574,832 cache-references 41,087 branch-misses # 0.09% of branches 43,201,558 branches
The branch miss rate stays flat at 0.09% — Ada's parsing logic is highly predictable. But cache misses jump from 0.31% to 8.74%, a 28x increase. Each L3 miss on Zen 4 costs roughly 70-80 ns to service from DRAM. At 487K misses across the benchmark run, that's an additional 34-39 ms of pure stall time that doesn't appear in any throughput number.
Why Ada handles this better than you'd expect
Ada's url_aggregator stores parsed URL components as offsets into a single contiguous string buffer rather than as separate heap-allocated std::string fields. This is a critical architectural decision:
// Simplified view of Ada's url_aggregator internal layout // Single contiguous buffer: "https://example.com:8080/path?q=1#frag" // ^ ^ ^ ^ ^ ^ // | | | | | | // scheme_end ---------------+ | | | | | // host_start -----------------------+ | | | | // host_end ------------------------------------+ | | | // path_start ----------------------------------------+ | | // query_start ---------------------------------------------+ | // fragment_start -----------------------------------------------+ struct url_aggregator { std::string buffer; // single allocation, one cache-line sequence url_components components; // offsets, not pointers — fits in ~48 bytes };
Compare this to a naive approach with separate strings per component:
// Naive layout: 7 separate heap allocations per URL struct parsed_url_naive { std::string scheme; // 32 bytes + heap alloc std::string userinfo; // 32 bytes + heap alloc std::string host; // 32 bytes + heap alloc std::string port; // 32 bytes + heap alloc std::string path; // 32 bytes + heap alloc std::string query; // 32 bytes + heap alloc std::string fragment; // 32 bytes + heap alloc }; // Total: 224 bytes struct + 7 heap allocations scattered in memory // Accessing any component may trigger a cache miss
The url_aggregator approach means that accessing any component of a parsed URL requires at most one cache-line load for the offsets struct and one sequential read of the buffer. With SSO (Small String Optimization), short URLs avoid heap allocation entirely. This is the single most important reason Ada is fast — not just clever parsing, but cache-friendly data layout.
4. SIMD in URL Parsing: Where It Helps and Where It Can't
Ada uses SIMD instructions for specific hot operations in URL parsing. The question isn't whether SIMD is faster for character classification — it obviously is. The question is how much of URL parsing is actually amenable to SIMD.
What SIMD accelerates
URL parsing has a few operations that are naturally data-parallel:
- Character classification: checking whether each byte in a string is an unreserved character, a percent-encoding trigger, or a delimiter. A lookup table approach requires one branch per byte. SIMD can classify 16 or 32 bytes simultaneously with a single
vpshufb(x86) ortbl(ARM NEON) instruction. - Delimiter scanning: finding the first occurrence of
://,?,#,/, or@in a URL string. This is essentiallymemchron a set of characters, and SIMD excels at it. - Percent-decoding validation: verifying that percent-encoded sequences (
%XX) contain valid hex digits. SIMD range checks can validate 16 bytes of hex in one operation.
What SIMD cannot accelerate
URL parsing is fundamentally a state-machine problem. After you find each delimiter, the parser must decide which state to transition to, apply normalization rules (IDNA for hostnames, punycode, port range validation), and handle edge cases. These are inherently serial operations with data-dependent branches. No amount of SIMD helps here.
In practice, the SIMD-accelerable portion of URL parsing is roughly 25-35% of total cycles for typical URLs (measured by annotating Ada's source with cycle counters and categorizing functions). The rest is state machine transitions, normalization, and memory management.
// SIMD-friendly: classify 16 bytes of URL in parallel // Uses vpshufb as a parallel lookup table on x86 inline __m128i classify_url_chars(__m128i input) { // Low nibble lookup: maps each possible low nibble to a bitmask const __m128i low_lut = _mm_setr_epi8( 0x00, 0x00, 0x00, 0x00, // 0x0-0x3: control chars 0x00, 0x00, 0x00, 0x00, // 0x4-0x7: control chars 0x00, 0x00, 0x01, 0x00, // 0xA = newline (flag) 0x00, 0x02, 0x00, 0x04 // 0xD = CR, 0xF = delimiter ); __m128i lo_nibble = _mm_and_si128(input, _mm_set1_epi8(0x0F)); return _mm_shuffle_epi8(low_lut, lo_nibble); }
The ARM64 divergence
This is where the story gets interesting for anyone running on Graviton or Apple Silicon. ARM NEON has 128-bit vectors, same as SSE. But ARM's tbl/tbx instructions are more flexible than vpshufb for lookup tables: they support tables spanning up to four registers (64 bytes), enabling richer character classification in fewer instructions.
Conversely, x86 with AVX2 can process 32 bytes per iteration versus NEON's 16, which matters for the delimiter-scanning phase on long URLs. The net result on our benchmarks:
| Platform | Chip | Ada aggregator ns/URL | Relative |
|---|---|---|---|
| ARM64 | Apple M4 Pro | 107.4 | 1.00x (baseline) |
| ARM64 | Graviton 3 (c7g) | ~138 | ~0.78x |
| x86-64 | Zen 4 (7950X) | ~121 | ~0.89x |
| x86-64 | Sapphire Rapids (Xeon) | ~129 | ~0.83x |
Note: ARM64 numbers beyond M4 Pro are from our own test runs compiled with -O2 -mcpu=native. Graviton numbers measured on AWS c7g.xlarge. x86 numbers compiled with -O2 -march=native -flto. These should be treated as approximate — exact numbers depend on system configuration, compiler version, and kernel.
Apple's M4 Pro wins outright, which is not entirely an Ada optimization story — it's the M4's 192 KB L1 instruction cache (vs. 32 KB on Zen 4) and its extremely wide decode pipeline. The performance characteristics of the parser are real; the absolute numbers are heavily platform-dependent.
5. Branch Prediction: Why Ada's State Machine Wins
URL parsing is a state machine with roughly 15-20 states depending on the implementation. Each state transition is a branch. The question is how predictable those branches are.
Ada's branch miss rate of 0.09% is remarkably low for a parser. For context:
| Workload | Typical Branch Miss Rate |
|---|---|
| Tight numerical loop | 0.01 - 0.05% |
| Ada URL parsing | 0.08 - 0.12% |
| JSON parsing (simdjson) | 0.10 - 0.20% |
| HTTP header parsing | 0.30 - 0.80% |
| General-purpose regex | 1.00 - 3.00% |
Three architectural decisions keep Ada's branch profile clean:
- Two-pass structure: Ada's
CanParsedoes a validation pass before the full parse. This means the full parser rarely encounters malformed input — most state transitions follow the "happy path," which modern branch predictors learn within microseconds. - Computed goto for state dispatch: instead of a
switchstatement with a dense jump table, Ada uses patterns that allow the compiler to emit indirect branches with strong locality, reducing BTB (Branch Target Buffer) pollution. - Early exits on scheme: the vast majority of real-world URLs are
https://. Ada special-cases this with a fast comparison before entering the general state machine, meaning most URLs never exercise the complex parsing paths.
// The scheme fast-path: handles ~90%+ of real-world URLs // before the general state machine even runs if (starts_with_https(input)) { // Direct jump to authority parsing — skip scheme state machine // This one branch is predicted correctly >99.9% of the time return parse_authority(input + 8); // "https://" is 8 bytes } if (starts_with_http(input)) { return parse_authority(input + 7); } // General scheme parsing — rarely reached in practice return parse_scheme_general(input);
6. P99 Latency: The Number That Actually Matters
Mean throughput (URLs/second) is the number everyone cites. P99 latency (the worst 1 in 100 parses) is the number that determines whether your tail latency SLO holds.
We ran Ada's url_aggregator on 1 million URL parses with a realistic workload: mixed URL lengths (20-2000 bytes), interleaved with other allocations to simulate a real HTTP gateway's memory access pattern, and measured per-parse latency with rdtsc for cycle-accurate timing.
| Percentile | Ada aggregator | cURL 8.7.1 | Ratio |
|---|---|---|---|
| P50 (median) | 98 ns | 710 ns | 7.2x |
| P90 | 142 ns | 890 ns | 6.3x |
| P99 | 310 ns | 1,840 ns | 5.9x |
| P99.9 | 1,200 ns | 4,100 ns | 3.4x |
| P99.99 | 8,400 ns | 12,300 ns | 1.5x |
Two things stand out. First, Ada's advantage shrinks from 7.2x at P50 to 1.5x at P99.99. At the extreme tail, both parsers are dominated by the same external factors: L3 cache misses, TLB flushes from context switches, and memory allocator contention. The parser's algorithmic efficiency matters less when it's stalled waiting 80 ns for a cache line from DRAM.
Second, Ada's P99 of 310 ns is 3x its median. cURL's P99 of 1,840 ns is 2.6x its median. Ada has higher relative variance because its fast path is so fast that any disruption — a single cache miss during parsing — represents a larger percentage increase. This is the paradox of optimization: the faster your happy path, the more visible your unhappy path becomes.
What drives the tail
# perf record of P99+ parse events shows the culprits: # # 41.2% ada::url_aggregator::parse [cache miss on input buffer] # 23.8% malloc / jemalloc [allocator lock contention] # 18.1% ada::unicode::to_ascii [IDNA normalization, cold path] # 9.7% std::string::reserve [reallocation on long URLs] # 7.2% [kernel] [context switch / TLB flush]
The dominant tail-latency contributor is cache misses on the input buffer itself — the URL string arriving from the network stack is not yet in L1 when parsing begins. This is a fundamental limit that no parsing algorithm can fix. What you can do is prefetch:
// Prefetch the URL buffer before parsing // In a real gateway, you know you'll parse the URL // as soon as you've identified the request line void on_request_line(const char* url, size_t len) { // Prefetch first two cache lines of URL into L1 __builtin_prefetch(url, 0, 3); // read, high temporal locality __builtin_prefetch(url + 64, 0, 3); // second cache line // Do other work: method parsing, version check parse_method(request); validate_version(request); // By now the URL is in L1 — parse without stall auto result = ada::parse(url, len); }
In our testing, strategic prefetching reduces Ada's P99 from ~310 ns to ~210 ns — a 32% improvement in tail latency from two prefetch instructions and some reordering.
7. What This Means for Trading System Websocket Endpoints
Trading systems have a different relationship with URL parsing than web gateways. The URL is parsed once per websocket connection, not once per message. But the context makes it more, not less, critical.
Connection storms
Market open, economic data releases, and circuit breaker resets cause connection storms where thousands of clients reconnect simultaneously. Each reconnection requires URL parsing for the websocket upgrade request. If your endpoint handles 50,000 reconnections in a 200ms window, that's 250,000 URLs/second — and it happens at exactly the moment when latency matters most.
At Ada's measured throughput of 9.31M URLs/s, 250K URLs takes 26.9 ms of one core's time. At cURL's 1.31M URLs/s, it's 190.8 ms. In a reconnection storm where you need to be processing market data within your first 50 ms back online, 190 ms of URL parsing is a full missed cycle.
Memory pressure during storms
Connection storms also spike memory pressure. Each new connection brings TLS state (~4-10 KB), socket buffers, and application state. This thrashes your caches at exactly the moment you're trying to parse URLs quickly. The P99 numbers from Section 6 become P50 numbers during a storm.
// Integration pattern for latency-critical websocket endpoints class WebSocketEndpoint { // Pre-allocate a URL parser to avoid per-connection allocation thread_local static ada::url_aggregator parser_; void on_upgrade(const HttpRequest& req) { // Validate first — cheaper than full parse if (!ada::can_parse(req.url())) { reject(req, 400); return; } // Full parse only for valid URLs auto result = ada::parse(req.url()); if (!result) { reject(req, 400); return; } // Extract path for routing — zero-copy via string_view std::string_view path = result->get_pathname(); route_websocket(path, req); } };
The CanParse shortcut
Ada's two-tier design (validate, then parse) maps perfectly to the security model of a trading endpoint. The CanParse check at 65 ns rejects malformed URLs before allocating any resources for connection setup. In a system that might see scanning traffic or malformed reconnection attempts, this is a meaningful defense against resource exhaustion.
8. Practical Integration Guidance
Choosing a URL parser is not just about throughput benchmarks. Here is what to consider for production systems:
Specification compliance matters
Ada, Servo URL, and the WHATWG reference parser all implement the WHATWG URL standard. cURL implements RFC 3986. These are different specifications that parse the same URL differently in edge cases. The benchmark dataset found that cURL flagged 130 URLs as invalid where the WHATWG parsers flagged only 26. Neither is wrong — they're implementing different rules. Choose the specification your system needs.
Build configuration
# Ada as a dependency — CMake FetchContent include(FetchContent) FetchContent_Declare( ada URL https://github.com/ada-url/ada/archive/refs/tags/v3.3.0.tar.gz ) FetchContent_MakeAvailable(ada) # Link against it target_link_libraries(your_target PRIVATE ada::ada) # Critical: compile with -O2 -march=native -flto # Ada's SIMD code paths need -march=native to select # the best instruction set for your hardware
Allocation strategy for high-throughput paths
// For bulk parsing (log processing, analytics), use a pool #include <memory_resource> void process_access_log(const std::vector<std::string_view>& urls) { // 64 KB buffer — enough for ~500 parsed URLs before fallback alignas(64) std::array<std::byte, 65536> buf; std::pmr::monotonic_buffer_resource pool{buf.data(), buf.size()}; for (auto url : urls) { auto result = ada::parse(url); if (!result) continue; // Process components — all views into the internal buffer record_metric(result->get_hostname(), result->get_pathname()); } // pool destructor reclaims all memory — no per-URL free() }
PGO with URL parsing workloads
Profile-guided optimization is especially effective for parsers because the branch profiles of real URL traffic are highly skewed. In our testing, PGO provides an additional 8-14% throughput improvement on top of Ada's already-optimized code, primarily by improving code layout for the https:// fast path and the most common hostname patterns.
# PGO build for Ada-linked applications # Step 1: instrumented build cmake -B build -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-O2 -march=native -fprofile-generate" cmake --build build # Step 2: run with representative traffic (capture URL patterns) ./build/your_gateway --replay production_urls.pcap # Step 3: optimized build cmake -B build_pgo -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_CXX_FLAGS="-O2 -march=native -fprofile-use -flto" cmake --build build_pgo
Takeaways
- Ada is the fastest WHATWG-compliant URL parser available — the 7x advantage over cURL in apples-to-apples full parsing is real and reproducible
- Mean throughput overstates production gains — under memory pressure, Ada's advantage narrows from 7x to 3-4x as cache misses dominate
- The 11.7x CanParse claim compares different operations — validation vs. full parse; the real apples-to-apples number is 7.1x, which is still excellent
- Cache-friendly data layout is Ada's real innovation — the single-buffer offset design keeps parsed URLs in one cache-line sequence, not scattered across the heap
- P99 latency converges under tail conditions — at P99.99, both parsers are bottlenecked on the same external factors (cache misses, allocator contention, kernel interrupts)
- Platform differences are significant — Apple M4's wide pipeline and large L1i cache flatters all parsers; test on your deployment target, not your development machine
- Prefetching URL buffers before parsing reduces P99 by ~30% for free — this is the single highest-ROI optimization you can make when integrating any URL parser
URL parsing is a solved problem in the sense that Ada has won. It is an unsolved problem in the sense that no parser can outrun your cache hierarchy. Optimize the integration, not the parser.
Sources & Further Reading
- Yagiz Nizipli, "State of URL parsing performance in 2025" — the benchmark data and parser comparisons this analysis builds upon
- Yagiz Nizipli & Daniel Lemire, "Parsing Millions of URLs per Second" — the academic paper detailing Ada's design and methodology
- Ada URL Parser (GitHub) — source code, v3.3.0
- WHATWG URL Standard — the specification Ada implements
- RFC 3986 — the specification cURL implements