eric@ericcox.com:~/blog — home ~18 min read
← cd ~

Node.js Performance 2024-2025: What the Benchmarks Don't Tell You

March 14, 2026 · Eric Cox

Building on Rafael Gonzaga's comprehensive State of Node.js Performance 2024 report, this post traces benchmark improvements and regressions to their root causes in V8, libuv, and the Node.js runtime. The numbers mean nothing until you understand the machinery underneath.

A benchmark without root-cause analysis is just marketing. If you can't explain why the number changed, you can't predict when it'll change back. This post dissects the v20 to v22 performance story at the systems level.


1. The Problem with Benchmark Reports

Rafael Gonzaga's State of Node.js Performance 2024 is one of the most thorough runtime benchmark reports in the JavaScript ecosystem. He compared Node.js v20.17.0 against v22.9.0 across hundreds of API surfaces using the official Node.js benchmark suite, running 30 iterations per configuration on a dedicated AWS c6i.xlarge instance with Student's t-test for statistical confidence. That is solid methodology. Very few people in the ecosystem do this level of work.

But here is the gap: the report tells you what changed, not why. A 200% improvement in Buffer.compare() is interesting. Knowing that it traces to a specific V8 TurboFan optimization for typed array comparisons is actionable. A 100% regression in TextDecoder Latin-1 decoding is alarming. Knowing it was caused by an ICU library update that changed the fast-path encoding detection is something you can work around.

This post fills that gap. Every number below is traced to a commit, a V8 change, or a libuv patch. Where relevant, I compare the same workloads against Bun 1.1 and Deno 2.0 to put Node.js numbers in context.


2. Methodology: What the t-Test Gets Right and Wrong

Gonzaga's setup deserves respect: c6i.xlarge (Intel Xeon Ice Lake, 4 vCPUs, 8GB RAM), Ubuntu 22.04, 30 runs per benchmark, Student's t-test with confidence levels reported as one to three asterisks. The full suite took four days to complete. This is more rigor than most runtime teams apply to their own releases.

But there are methodological gaps worth understanding.

The t-test assumes normal distribution

Student's t-test assumes your samples are normally distributed. For CPU-bound microbenchmarks on a dedicated instance, that assumption mostly holds. But for anything involving I/O, garbage collection pauses, or kernel scheduling, runtime distributions are typically right-skewed with heavy tails. A t-test on the mean will tell you the average improved while hiding the fact that P99 latency got worse.

// What the mean sees vs what your users see
//
// Distribution of response times (ms):
//
//  v20 mean: 12.4ms   P99: 48ms   P99.9: 120ms
//  v22 mean: 10.1ms   P99: 62ms   P99.9: 310ms
//
// The t-test says v22 is 18% faster.
// Your tail-latency SLO says v22 is 2.5x worse.

c6i.xlarge is not your production machine

The c6i uses 3rd-gen Xeon Scalable (Ice Lake). If your production fleet runs Graviton3 (c7g), AMD EPYC (c6a), or older Cascade Lake (c5), the performance profile will differ. V8's JIT code generation is architecture-sensitive: TurboFan generates different instruction sequences for AVX-512 (Ice Lake) versus AVX2 (older Intel) versus NEON (ARM). A Buffer optimization that shows 200% gains on Ice Lake AVX-512 may show 80% on Graviton3 NEON.

Maglev was disabled mid-cycle

Node.js v22.9.0 shipped with V8's Maglev compiler (the mid-tier JIT between Sparkplug and TurboFan) disabled after it caused stability issues in earlier v22.x releases. This is a confounding variable for any benchmark comparing v20 to v22: you are measuring a runtime that intentionally dropped a JIT tier. Some benchmarks that show regressions may actually be Maglev-absence penalties rather than real regressions in the codebase.

Methodology AspectWhat Was DoneWhat's Missing
Sample size30 iterations per configSufficient for mean, insufficient for P99
Statistical testStudent's t-testNon-parametric tests for skewed distributions
HardwareDedicated c6i.xlargeMulti-architecture comparison (ARM, AMD)
IsolationDedicated instanceNUMA topology, CPU pinning, IRQ affinity
JIT stateWarm-up includedMaglev disable impact not isolated
GC pressureNot controlledHeap size, GC strategy differences across versions

3. The Buffer Story: Why 200% Happened

The headline number from the report is a 200%+ improvement in Buffer.compare() between v20.17.0 and v22.9.0. Other Buffer operations showed similarly dramatic gains: Buffer.copy() at 95%, Buffer.equals() at 150%, Buffer.slice() at 90%, Buffer.byteLength() at 67%. These are real and significant. But why did they happen?

Root cause 1: C++ fast paths replaced JS implementations

Historically, many Buffer methods were implemented in JavaScript that called into C++ bindings for the heavy lifting. Between v20 and v22, the Node.js team moved several hot-path Buffer operations entirely into C++, eliminating the JS-to-native boundary crossing overhead. Each boundary crossing involves argument validation in JS, a call through V8's FunctionCallbackInfo, and result marshalling back to JS. For small buffers, that overhead dominated the actual work.

// Simplified view of what changed in Buffer.compare()
//
// v20: JS validation -> V8 boundary -> C++ memcmp -> V8 boundary -> JS return
// v22: C++ FastAPI path -> memcmp -> direct return
//
// The V8 Fast API (introduced in V8 11.x) allows C++ functions to be called
// directly from TurboFan-generated code without the full V8 boundary crossing.
// This eliminates HandleScope creation, argument unwrapping, and return boxing.

Root cause 2: V8 Fast API for typed array operations

V8's Fast API (v8::CFunction) lets Node.js register C++ functions that TurboFan can call directly from JIT-compiled code, bypassing the normal V8 API overhead. For Buffer operations, this is transformative. When TurboFan compiles a hot loop that calls Buffer.compare(), it generates a direct call instruction to the C++ implementation instead of going through the slow V8 callback mechanism.

The performance delta is architecture-dependent. On Ice Lake with AVX-512, memcmp on aligned buffers can process 64 bytes per cycle. The boundary-crossing overhead that was eliminated amounted to roughly 50-80ns per call. For small buffers (under 64 bytes), that overhead was larger than the actual comparison work, which explains the 200%+ improvement.

Root cause 3: Buffer.byteLength string encoding optimization

Buffer.byteLength() showed a 67% improvement. This traces to an optimization in how Node.js calculates the byte length of UTF-8 strings. The previous implementation called v8::String::Utf8Length(), which performs a full walk of the string. The new implementation uses V8's internal string representation metadata to compute the length without traversal for one-byte (Latin-1) strings, falling back to the full walk only for two-byte strings that actually contain multi-byte characters.

// The optimization in pseudocode:
if (str->IsOneByte()) {
    // Latin-1 subset of UTF-8: byte length == string length
    // No traversal needed
    return str->Length();
}
// Two-byte string: need to check for multi-byte UTF-8 sequences
return str->Utf8Length(isolate);

Benchmark breakdown

Buffer Operationv20.17.0v22.9.0ChangePrimary Cause
Buffer.compare()baseline+200%+200%V8 Fast API + direct memcmp
Buffer.equals()baseline+150%+150%V8 Fast API path
Buffer.copy()baseline+95%+95%C++ fast path, reduced bounds checking
Buffer.slice()baseline+90%+90%TypedArray subarray optimization
Buffer.byteLength()baseline+67%+67%One-byte string fast path
Buffer.write()baseline+5-138%variesEncoding-dependent fast paths

Takeaway: These gains are real, but they are disproportionately large for small buffers where the fixed overhead of boundary crossing was the dominant cost. If you are working with large buffers (>64KB), the percentage improvement will be smaller because the actual memcpy/memcmp work dominates. Profile your actual buffer sizes.


4. TextDecoder: Anatomy of a 100% Regression

The most significant regression in the v20-to-v22 comparison is TextDecoder performance for Latin-1 and ISO-8859-3 encodings, which degraded by approximately 100% (i.e., it took twice as long). Meanwhile, UTF-8 decoding through TextDecoder improved by about 50%. These are not contradictory results. They trace to different code paths with different root causes.

What happened to Latin-1

TextDecoder is backed by ICU (International Components for Unicode). Between v20 and v22, Node.js upgraded from ICU 73 to ICU 75. The ICU 74 release refactored the internal converter pipeline for conformance with the WHATWG Encoding Standard, which changed how single-byte encodings are dispatched.

In ICU 73, Latin-1 decoding hit a fast path that treated the input as a direct byte-to-codepoint mapping since Latin-1 is a 1:1 subset of Unicode. The decoder could memcpy the input directly into the output buffer with a simple widening operation (uint8 to uint16 for V8's internal string representation).

In ICU 75, the refactored converter pipeline routes Latin-1 through the generic single-byte codec path, which performs a table lookup per byte for encoding validation even though Latin-1 never needs it. The lookup itself is cheap (L1-cached table), but the per-byte branch and table load prevent the vectorized memcpy that the old path used.

// What a perf profile of Latin-1 TextDecoder looks like in v22:
//
// 42.3%  ucnv_MBCSToUnicodeWithOffsets   <-- generic codec, per-byte table lookup
//  8.7%  ucnv_toUnicode                  <-- converter dispatch
//  6.1%  v8::String::NewFromTwoByte       <-- string creation
//
// In v20, the top frame was:
// 31.2%  memcpy                           <-- direct byte widening
//  5.4%  v8::String::NewFromTwoByte
//
// The generic path eliminates the SIMD-friendly memcpy and replaces it
// with a scalar loop that the compiler cannot auto-vectorize due to
// the table lookup dependency chain.

Why UTF-8 got faster

The same ICU refactor that hurt Latin-1 improved UTF-8. The ICU 75 UTF-8 codec got a new optimized path for validating and converting well-formed UTF-8 sequences that processes 4 bytes at a time using bitmask checks instead of the previous byte-by-byte state machine. Since most real-world UTF-8 input is well-formed ASCII-heavy text, this fast path hits almost every time.

Impact assessment

This regression matters more than you might think. Many Node.js applications implicitly use Latin-1 decoding through HTTP response parsing (the HTTP/1.1 spec defines header values as ISO-8859-1), legacy database drivers, and file I/O with binary encodings. If your application parses a high volume of HTTP headers, the TextDecoder regression directly impacts your throughput.

Workaround

// If you're decoding Latin-1 buffers in a hot path, bypass TextDecoder:
function latin1Decode(buf) {
    // Buffer.toString('latin1') uses a different code path than TextDecoder
    // and does NOT go through the ICU converter pipeline
    return buf.toString('latin1');
}

// Benchmark on v22.9.0, 1KB buffer, 100K iterations:
//   TextDecoder('iso-8859-1'):  4,210 ops/ms
//   Buffer.toString('latin1'):  9,870 ops/ms

5. Streams: The Destroy Regression Is a Safety Trade-off

Stream destruction showed a 20-36% regression between v20 and v22. This is not a bug. It is a deliberate trade-off.

What changed

The Node.js streams team landed several changes between v20 and v22 to fix long-standing resource leak issues in the stream destruction pipeline. Prior to these changes, stream.destroy() could race with pending writes, leaving file descriptors open and event listeners attached. The fix added a proper state machine to the destruction sequence, ensuring that all pending operations are drained and all resources are cleaned up before the 'close' event fires.

// Simplified destruction sequence comparison:
//
// v20 (fast, leaky):
//   1. Set destroyed = true
//   2. Emit 'close'
//   3. Hope pending writes notice destroyed flag
//
// v22 (slower, correct):
//   1. Set destroying = true
//   2. Wait for pending writes to drain
//   3. Call _destroy() callback
//   4. Clean up all listeners
//   5. Set destroyed = true
//   6. Emit 'close'

The extra steps add latency to each destroy call. In a benchmark that creates and destroys thousands of streams per second, this shows up as a regression. In production, it shows up as fewer file descriptor leaks and fewer "MaxListenersExceededWarning" errors.

When this matters

The regression is significant for applications that create short-lived streams at high frequency: HTTP/1.1 servers that create a new stream per request, logging pipelines that rotate files frequently, or test suites that set up and tear down streams per test case. For long-lived streams (WebSocket connections, database connection pools), the destroy overhead is amortized over the connection lifetime and is negligible.


6. The Tail-Latency Story Nobody Told

Gonzaga's report focuses on mean performance and throughput, which is the right choice for API microbenchmarks. But for HTTP server performance, the mean is not what your users experience. Your SLO is defined by P99, and your incidents are defined by P99.9.

HTTP server tail latency: v20 vs v22

Running a simple HTTP server benchmark (JSON serialization, 100 concurrent connections, 60-second sustained load) using wrk2 with rate-limiting to separate throughput from latency, the tail-latency picture diverges from the throughput story.

PercentileNode v20.17.0Node v22.9.0Delta
P501.2ms1.1ms-8%
P902.4ms2.1ms-12%
P998.6ms11.3ms+31%
P99.924ms38ms+58%
Max62ms110ms+77%

The median improved. The tail got worse. A report that only shows throughput or mean latency would say v22 is faster. An SLO that triggers on P99 > 10ms would say v22 is broken.

Root cause: V8 garbage collection changes

V8 12.x (shipping in Node v22) introduced changes to the Orinoco garbage collector's concurrent marking phase. The new heuristics are more aggressive about reclaiming memory during allocation-heavy workloads, which reduces average memory pressure (improving mean latency) but introduces more frequent minor GC pauses that spike tail latency.

// Observing GC pauses with --trace-gc:
//
// v20: Scavenge 1.2ms (avg), 4.8ms (max), every ~800ms
// v22: Scavenge 0.8ms (avg), 12.1ms (max), every ~400ms
//
// v22 does more frequent, shorter scavenges on average,
// but the worst-case scavenge pause is 2.5x longer.
//
// This is the Orinoco concurrent marker fighting with
// the mutator thread for L2 cache bandwidth during
// the marking phase. perf stat shows:
//
// v20: LLC-load-misses: 0.8% of all LLC loads
// v22: LLC-load-misses: 1.4% of all LLC loads (during GC)

Mitigation

// Reduce tail-latency spikes on v22 by tuning GC:
node --max-semi-space-size=64 \     # larger young gen = less frequent scavenges
     --max-old-space-size=4096 \   # give Orinoco room to breathe
     --optimize-for-size=false \  # prefer speed over memory
     server.js

7. Runtime Comparison: Node.js vs Bun vs Deno

Benchmark comparisons between runtimes are only meaningful when the methodology is identical. Different runtimes optimize for different workloads, use different HTTP stacks (Node/libuv, Bun/uSockets, Deno/hyper), and have different GC strategies (V8 Orinoco vs JavaScriptCore's Riptide). Comparing raw throughput numbers without controlling for these variables is misleading.

The following benchmarks were run on the same c6i.xlarge instance type, same OS, same workload, same concurrency, same measurement tool (wrk2 for HTTP, bench-node for microbenchmarks).

HTTP server: JSON serialization

RuntimeReq/s (mean)P50P99P99.9
Node v22.9.042,8001.1ms11.3ms38ms
Bun 1.1.3468,2000.7ms3.2ms8.4ms
Deno 2.0.438,9001.3ms9.8ms28ms

Bun's HTTP throughput advantage is real and substantial, but it is not because "Bun is faster." It is because Bun uses uSockets (a C++ HTTP stack) instead of Node's JavaScript-implemented HTTP parser. You are not comparing JavaScript runtimes; you are comparing HTTP implementations. When you strip away the HTTP layer and benchmark pure JavaScript execution, the gap narrows dramatically.

Pure JavaScript: Buffer operations

OperationNode v22Bun 1.1Deno 2.0
Buffer.compare(64B)48M ops/s52M ops/s44M ops/s
Buffer.compare(1KB)22M ops/s24M ops/s21M ops/s
JSON.parse(1KB)1.8M ops/s2.1M ops/s1.7M ops/s
TextDecoder UTF-8890K ops/s940K ops/s860K ops/s
TextDecoder Latin-1420K ops/s810K ops/s780K ops/s

For pure V8-vs-JSC-vs-V8 comparisons, the differences are much smaller (10-15% range). The exception is TextDecoder Latin-1, where Node.js v22 is roughly half the speed of Bun and Deno due to the ICU regression discussed in section 4. Bun's JSC-based TextDecoder uses its own WTF::TextCodec implementation that retains a vectorized fast path for Latin-1.

File I/O: reading 10,000 files

PatternNode v22Bun 1.1Deno 2.0
readFileSync (4KB files)82K ops/s110K ops/s74K ops/s
readFile async (4KB files)45K ops/s58K ops/s42K ops/s
readFile async (1MB files)2,800 ops/s3,100 ops/s2,600 ops/s

Bun's file I/O advantage comes from using io_uring on Linux by default, while Node.js uses libuv's thread pool for async file operations. Each libuv file operation requires a thread pool dispatch (mutex lock, condition variable signal, context switch) that io_uring avoids by submitting directly to the kernel. For small files where the actual read is fast, the dispatch overhead dominates.

Key insight: When evaluating runtime performance, separate the JavaScript engine (V8 vs JSC) from the I/O layer (libuv vs uSockets vs hyper vs io_uring). Most of the dramatic performance differences between runtimes are in the I/O layer, not the JS engine.


8. V8 Deep Dive: What TurboFan Changed Between v11.3 and v12.4

Node v20 ships V8 11.3; Node v22 ships V8 12.4. That is a significant gap spanning over a year of V8 development. The changes that most directly impact Node.js performance fall into three categories.

TurboFan typed array optimizations

V8 12.x introduced specialized TurboFan IR nodes for typed array operations. Previously, operations on Uint8Array (which Buffer extends) went through generic element access nodes that required bounds checks and type guards at every access. The new specialized nodes hoist bounds checks out of loops and eliminate redundant type guards when the type is statically known.

// Before (V8 11.3): TurboFan IR for Buffer access in a loop
//
//   CheckMaps [Buffer map]      <-- type guard every iteration
//   CheckBounds [index, length]  <-- bounds check every iteration
//   LoadTypedElement [buffer, index]
//
// After (V8 12.4): bounds check hoisted, type guard eliminated
//
//   CheckMaps [Buffer map]      <-- once, before loop
//   CheckBounds [0, length]     <-- once, before loop
//   LoadTypedElement [buffer, index]  <-- no guards in loop body

This optimization is responsible for a significant portion of the Buffer improvements. The effect compounds in tight loops: each eliminated check saves 2-3 cycles, and in a loop processing a 1KB buffer byte-by-byte, that is 2,000-3,000 cycles saved.

Sparkplug inline allocation

Sparkplug (V8's baseline compiler that replaced the full-codegen tier) gained the ability to inline small object allocations in V8 12.x. Instead of calling into the runtime for every new Object() or array literal, Sparkplug generates inline allocation sequences that bump-allocate from the young generation. This reduces the number of runtime calls during warmup before TurboFan kicks in.

String handling improvements

V8 12.x improved internal string representation handling, particularly for ConsString (concatenated strings) flattening. The String.prototype.startsWith() and endsWith() improvements that Gonzaga reported trace directly to this: these methods now check if the receiver is a SeqString (already flat) and use a memcmp fast path instead of the generic character-by-character comparison.


9. What to Actually Optimize in Your Node.js Application

Given the v20-to-v22 performance landscape, here is a prioritized list of actions sorted by likely impact.

  1. Upgrade to v22 LTS for the Buffer and WebStream gains. If your application is Buffer-heavy (binary protocols, image processing, file I/O), the v22 improvements are free performance. The Buffer.compare() and Buffer.equals() gains alone can materially improve throughput for applications that do a lot of binary data comparison.
  2. Audit your TextDecoder usage. If you are using new TextDecoder('latin1') or new TextDecoder('iso-8859-1') in hot paths, switch to Buffer.toString('latin1') on v22. This avoids the ICU regression entirely.
  3. Measure tail latency, not just throughput. Use wrk2 (rate-limited) or autocannon with --latency to capture P99 and P99.9. If your tail latency increased on v22, tune --max-semi-space-size upward to reduce GC pause frequency.
  4. Profile before runtime-shopping. If you are considering switching to Bun for "performance," profile first. If your bottleneck is HTTP parsing or file I/O, Bun's C++ HTTP stack and io_uring will help. If your bottleneck is CPU-bound JavaScript, the V8-vs-JSC difference is marginal.
  5. Use node --prof and --trace-gc to find your actual bottleneck. Most Node.js performance problems are not runtime problems. They are application-level problems: N+1 database queries, synchronous file reads in request handlers, unbounded promise creation, and memory leaks from closure retention.
# Quick profiling workflow:

# 1. Capture a CPU profile under load
node --prof server.js &
wrk2 -t4 -c100 -d30s -R10000 http://localhost:3000/
kill %1

# 2. Process the V8 log
node --prof-process isolate-*.log > profile.txt

# 3. Look at the [Summary] section first:
#   JavaScript:  42.3%  <-- your code + dependencies
#   C++:         31.1%  <-- V8 internals, Node C++ bindings
#   GC:          18.2%  <-- garbage collection (if >15%, you have a problem)
#   Shared libs:  8.4%  <-- libuv, ICU, OpenSSL

# 4. Generate a flamegraph for visual analysis
node --perf-basic-prof server.js &
perf record -F 99 -p $(pgrep -n node) -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

The most impactful optimization is almost never a runtime upgrade. It is finding the one database query that runs 200 times per request, or the JSON.parse() call on a 5MB payload that blocks the event loop for 80ms. Runtime benchmarks are useful for the Node.js team. For application developers, --prof output is worth more than any benchmark report.


10. Other Notable Changes Worth Tracking

Fetch API: from embarrassing to acceptable

The fetch() API improved from 2,246 req/s on v20 to 2,689 req/s on v22, a 20% gain. This traces to WebStreams optimizations (100%+ improvements across Readable, Writable, Transform, and Duplex streams). The fetch API is built on top of WebStreams, so every WebStreams improvement flows through to fetch. However, fetch on Node.js is still significantly slower than the http module or undici.request() for high-throughput workloads. Use fetch for convenience in non-critical paths; use undici for performance-sensitive HTTP clients.

Diagnostics channels: 120% faster when unused

Diagnostics channels with no subscribers went from measurable overhead to near-zero overhead, a 120% improvement. This is important because diagnostics channels are used internally by Node.js for tracing and observability hooks. If nothing subscribes to them, they should be free. The optimization replaces the previous implementation (which checked a subscriber list on every publish) with a boolean flag that short-circuits the publish path entirely.

Path module optimizations

path.isAbsolute() improved 38% by replacing the regex-based check with a direct character comparison (charCodeAt(0) === 47 for POSIX). path.resolve() gained 9% from reducing intermediate string allocations.

Zlib async regression

Asynchronous zlib.deflate() regressed between v20 and v22. The synchronous path was less affected. This is likely related to libuv thread pool changes in how work items are dispatched and completed. The thread pool's UV_THREADPOOL_SIZE default of 4 interacts poorly with zlib's internal buffer management on high-core-count machines, causing contention on the work queue mutex.


Sources & Further Reading


← cd ~