eric@ericcox.com:~/blog — home ~18 min read
← cd ~

Killing Cold Starts: AOT Compilation vs. Runtime Tricks

March 14, 2026 · Eric Cox

Cold starts are the serverless tax nobody budgeted for. This post compares every serious approach to eliminating them — AOT compilation, snapshots, provisioned concurrency, and native runtimes — benchmarked on identical workloads with actual billing analysis.

The premise: A cold start is not just latency. It is wasted compute you are billed for, tail latency your users absorb, and architectural complexity you accumulate to work around it. The right fix depends on what you are actually optimizing for.


1. The Cold Start Tax

Every Lambda invocation that hits a cold execution environment pays an initialization penalty: the runtime boots, your code loads, dependencies initialize, and your handler finally runs. On a Node.js 22 Lambda at 128 MB, that penalty is typically 200–500 ms. At 1024 MB, it drops to around 150–300 ms because AWS allocates proportionally more CPU with memory. Either way, it is time your user is waiting and compute you are paying for.

The numbers get worse with real applications. Import @aws-sdk/client-dynamodb, a JSON schema validator, and a logging library, and cold starts on Node.js 22 regularly exceed 800 ms at 128 MB. Java with Spring Boot can hit 5–10 seconds. These are not edge cases; they are the default experience for anyone not actively mitigating the problem.

Building on Oliver Medhurst's exploration of Porffor's AOT compilation for Lambda (goose.icu/lambda), this post takes a wider view. Porffor's approach is genuinely interesting — compiling JavaScript ahead-of-time to WebAssembly and native binaries, producing artifacts measured in kilobytes rather than megabytes. But it is one tool in a larger space. Here, we benchmark every serious cold-start elimination strategy on the same workload and analyze the actual cost impact.


2. Quantifying the Real Cost

Lambda bills in 1-ms increments. A cold start is not free — you pay for every millisecond of initialization. Let's make this concrete.

Consider an API endpoint handling 100,000 requests per day. Assume 5% of invocations hit cold starts (a conservative estimate for bursty traffic). With a 400 ms cold-start penalty on Node.js at 128 MB:

// Cold start billing impact
const daily_requests   = 100_000;
const cold_start_rate   = 0.05;
const cold_start_ms     = 400;
const price_per_ms_128  = 0.0000000021;  // USD, 128 MB

const daily_cold_starts = daily_requests * cold_start_rate;  // 5,000
const wasted_ms         = daily_cold_starts * cold_start_ms; // 2,000,000 ms
const daily_waste       = wasted_ms * price_per_ms_128;      // $0.0042
const annual_waste      = daily_waste * 365;                 // $1.53

At 128 MB, the raw billing impact looks trivial — about $1.53 per year. But this is misleading for three reasons:

  • Most production Lambdas run at 512 MB–1024 MB, quadrupling to octupling the per-ms cost. At 1024 MB, the same scenario costs ~$12/year per endpoint.
  • The real cost is tail latency. If your P99 latency jumps from 50 ms to 450 ms on cold starts, that is a user-facing quality regression that no amount of billing analysis captures.
  • It compounds across services. A microservice architecture with 50 Lambda functions, each with independent cold-start probabilities, means at least one cold start on nearly every request path.

The billing argument for cold-start elimination is weak at small scale. The latency argument is strong at any scale.


3. The Benchmark Workload

Medhurst's original benchmark used a minimal handler: return a greeting with the user-agent string and a timestamp. That is useful for measuring the floor, but it tells you nothing about how these approaches behave with realistic work.

We test two workloads on every approach:

Workload A: Minimal handler (baseline)

// The handler from Medhurst's original benchmark
export const handler = async () => ({
  statusCode: 200,
  headers: { "Content-Type": "text/plain" },
  body: "Hello from " + navigator.userAgent + " at " + Date()
});

Workload B: Realistic JSON API handler

// JSON parsing, validation, transformation, serialization
export const handler = async (event) => {
  const body = JSON.parse(event.body);
  const validated = validateSchema(body, schema);
  if (!validated.ok) return { statusCode: 400, body: validated.error };

  const result = transformPayload(validated.data);
  const enriched = {
    ...result,
    timestamp: Date.now(),
    request_id: event.requestContext.requestId,
    region: process.env.AWS_REGION
  };

  return {
    statusCode: 200,
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(enriched)
  };
};

All benchmarks run on x86_64 Lambda at 128 MB (minimum) and 1024 MB, us-east-1, measured over 200 cold invocations using the invoke API with forced cold starts (updating an environment variable between each invocation). We report P50 and P99 init duration from CloudWatch.


4. AOT Compilation Approaches

Ahead-of-time compilation eliminates the cold start at its root: there is no interpreter to boot and no bytecode to JIT. The binary starts, runs main(), and your handler is already native code.

Porffor: JavaScript to WASM/native

Porffor compiles JavaScript directly to WebAssembly or native binaries via its own compiler pipeline, not by bundling a runtime. Medhurst's key claim: ~12x faster cold starts than Node.js and over 2x cheaper per invocation.

Let's examine these claims. The 12x speedup is measured on the minimal handler (Workload A) at 128 MB. In Medhurst's testing, Porffor's init duration was in the single-digit millisecond range versus Node.js's ~100–150 ms init. The ratio checks out for that specific workload, though "12x" is rounded favorably — the actual range across runs is roughly 8–15x depending on the specific cold start.

The 2x cost claim is more nuanced. Porffor's total billed duration (init + execution) is shorter, so you pay for fewer milliseconds. At 128 MB, this is accurate. But the cost saving is measured in fractions of a cent per thousand invocations — the absolute numbers are tiny. The real value is latency, not billing.

The critical limitation: Porffor passes approximately 60% of Test262 (the JavaScript conformance test suite). It lacks full support for async/await in complex patterns, many Node.js built-in modules, and the npm ecosystem. You cannot require('aws-sdk'). This is not a drop-in replacement for Node.js; it is a research compiler that handles a subset of JavaScript.

MetricPorfforNode.js 22LLRT
Binary Size (Workload A)~16 KBN/A (managed)~8 MB
Init Duration P50 (128 MB)~8 ms~110 ms~35 ms
Init Duration P99 (128 MB)~18 ms~180 ms~65 ms
Test262 Conformance~60%~99%~95% (QuickJS)
npm EcosystemNoneFullPartial

GraalVM native-image: Java/Kotlin/Scala to native

GraalVM's native-image tool performs whole-program AOT compilation of JVM bytecode to a standalone native executable. Unlike Porffor, this is a mature, production-grade tool used by companies like Oracle, Red Hat, and Facebook.

The approach eliminates the JVM startup entirely. A Spring Boot application that cold-starts in 6–8 seconds on the JVM typically starts in 50–200 ms as a native image. The trade-off: peak throughput is lower because native-image cannot perform the runtime profiling and speculative optimizations that the C2 JIT compiler does.

# Build a native image from a Lambda handler
native-image \
  --no-fallback \
  -H:+ReportExceptionStackTraces \
  --initialize-at-build-time \
  -jar lambda-handler.jar \
  -o bootstrap

# Result: single static binary, ~30-50 MB
# Cold start: 50-150 ms instead of 5-8 seconds

The binary sizes are larger than Porffor (30–50 MB vs. kilobytes) because native-image includes the Substrate VM and all reachable code. But compared to a full JVM deployment, it is dramatically smaller and faster to start.

What AOT actually does at the binary level

Both Porffor and GraalVM native-image perform the same fundamental transformation: they move work from runtime to build time. Specifically:

  • Class/module loading — resolved at build time, not on first invocation
  • Type checking and optimization — monomorphic call sites are devirtualized and inlined
  • Dead code elimination — unreachable code is stripped from the binary
  • Heap snapshotting — GraalVM can serialize initialized objects into the binary, eliminating constructor work at startup

The cost is build-time complexity and, in GraalVM's case, reflection configuration. Any code that uses reflection, dynamic proxies, or runtime class generation requires explicit configuration for the AOT compiler.


5. Snapshot-Based Approaches

Snapshots take a different approach: run the initialization once, serialize the entire process state to disk, and restore it on subsequent starts. The cold start becomes a memory-mapped file load instead of executing initialization code.

V8 snapshots (Node.js)

V8 has supported heap snapshots since 2015. The Node.js binary itself uses one — built-in modules are deserialized from a snapshot rather than parsed on every start. Tools like v8-compile-cache extend this to user code by caching compiled bytecode.

For Lambda specifically, you can use the NODE_COMPILE_CACHE environment variable (Node.js 22+) to cache compiled bytecode in /tmp. This does not help cold starts (the cache is empty on a new execution environment), but it accelerates warm reuse when Lambda recycles the same environment.

Custom V8 snapshots that include your application code are possible but require building a custom Node.js binary — impractical for most teams.

CRaC: Coordinated Restore at Checkpoint

CRaC is an OpenJDK project that checkpoints a running JVM process and restores it later. Unlike GraalVM native-image, CRaC preserves the full JIT-optimized state: the restored process has the same compiled code, inline caches, and optimized hot paths as the original.

// CRaC lifecycle hooks in a Lambda handler
public class Handler implements Resource {
    private DynamoDbClient client;

    public Handler() {
        // Heavy initialization: SDK client, connection pool
        this.client = DynamoDbClient.create();
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(Context<? extends Resource> ctx) {
        // Close connections before snapshot
        client.close();
    }

    @Override
    public void afterRestore(Context<? extends Resource> ctx) {
        // Reopen connections after restore
        this.client = DynamoDbClient.create();
    }
}

AWS SnapStart is the managed implementation of this approach for Java Lambdas. It snapshots the initialized JVM after the Init phase and restores from the snapshot on cold starts. The result: Java cold starts drop from 5–10 seconds to 200–400 ms with zero code changes (though adding CRaC hooks for connection cleanup is strongly recommended).

The snapshot approach has a fundamental advantage over AOT: it preserves JIT-optimized code. A CRaC-restored process performs like a warm process from the first invocation. GraalVM native-image, by contrast, runs entirely on AOT-compiled code that lacks the speculative optimizations a JIT would produce after profiling.


6. Infrastructure Approaches

You can also solve cold starts without changing your code at all — by paying AWS to keep execution environments warm.

Provisioned Concurrency

Provisioned Concurrency pre-initializes a fixed number of execution environments. Cold starts are eliminated entirely for requests served by provisioned instances. The cost: you pay for every provisioned instance-second whether it handles a request or not.

# Provision 10 warm environments
aws lambda put-provisioned-concurrency-config \
  --function-name my-api \
  --qualifier prod \
  --provisioned-concurrent-executions 10

# Cost at 128 MB: $0.000004646 per second per instance
# 10 instances, 24/7: $0.000004646 * 10 * 86400 * 30 = ~$120/month

At $120/month for 10 provisioned instances of a 128 MB Lambda, this is more expensive than most cold-start optimization approaches. The cost scales linearly with the number of provisioned instances and the memory allocation. At 1024 MB, the same configuration costs ~$960/month.

Provisioned Concurrency makes economic sense in a narrow window: when you need guaranteed latency for a small number of functions and the traffic pattern is predictable enough that you can right-size the provisioned count. For bursty or unpredictable workloads, you either over-provision (wasting money) or under-provision (still getting cold starts).

SnapStart (AWS-managed snapshots)

SnapStart is currently available for Java (Corretto 11+) and .NET runtimes. It creates a Firecracker microVM snapshot after initialization and restores from it on cold starts. Unlike Provisioned Concurrency, there is no additional charge — you pay only for standard Lambda pricing.

The latency reduction is substantial but not as dramatic as Provisioned Concurrency (which has zero cold-start latency by definition). SnapStart reduces Java cold starts from seconds to hundreds of milliseconds. For workloads where 200–400 ms init is acceptable, it is the clear winner on cost-effectiveness.

Warm pools (DIY)

The oldest trick: a CloudWatch Events rule that invokes your Lambda every 5 minutes to keep execution environments warm. This is fragile (it keeps at most one environment warm per scheduled invocation), does not scale with concurrency, and is essentially a hack. Use Provisioned Concurrency if you need guaranteed warm instances.


7. Native Alternatives: The Baseline Nobody Talks About

The most reliable way to eliminate cold starts is to use a language that does not have them. Rust and Go compile to static native binaries with no runtime initialization overhead. On Lambda, they start in single-digit milliseconds.

Rust Lambda: the production baseline

use aws_lambda_events::apigw::{ApiGatewayProxyRequest, ApiGatewayProxyResponse};
use lambda_runtime::{service_fn, Error, LambdaEvent};

async fn handler(
    event: LambdaEvent<ApiGatewayProxyRequest>,
) -> Result<ApiGatewayProxyResponse, Error> {
    let body: serde_json::Value = serde_json::from_str(
        event.payload.body.as_deref().unwrap_or("{}")
    )?;

    let response = ApiGatewayProxyResponse {
        status_code: 200,
        headers: [("Content-Type".into(), "application/json".into())]
            .into(),
        body: Some(serde_json::to_string(&body)?.into()),
        ..Default::default()
    };
    Ok(response)
}

#[tokio::main]
async fn main() -> Result<(), Error> {
    lambda_runtime::run(service_fn(handler)).await
}

Compiled with --release and target x86_64-unknown-linux-musl, this produces a static binary around 6–10 MB (depending on dependencies). The init duration on Lambda is consistently 3–8 ms at 128 MB. That is not "faster than Node.js" — it is in a fundamentally different category.

Go Lambda

Go produces slightly larger binaries (10–15 MB for a typical handler) with similarly fast cold starts (5–12 ms). The Go runtime initializes in microseconds. The garbage collector adds some overhead on long-running warm invocations, but for cold-start performance, Go is essentially equivalent to Rust.

Why this matters for the Porffor comparison

Porffor's ~8 ms cold start at 128 MB is impressive for JavaScript. But Rust achieves the same latency with full language support, a mature ecosystem, and production-grade reliability. The question is not "can we make JavaScript start faster?" but "should we contort JavaScript into something it is not when proven alternatives exist?"

The answer depends on your team. If your organization is a JavaScript shop and rewriting in Rust is not realistic, then LLRT (35 ms cold starts, partial npm support) or future Porffor maturity are worth watching. If you have the flexibility to choose your runtime, Rust on Lambda is the cold-start problem fully solved.


8. Comprehensive Benchmark

All numbers below are init duration (the Init Duration field from CloudWatch Logs), measured over 200 forced cold starts per configuration. Workload A is the minimal handler; Workload B is the JSON API handler described in Section 3.

128 MB Lambda (minimum allocation)

ApproachA: P50A: P99B: P50B: P99Binary Size
Node.js 22 (managed)110 ms180 ms420 ms680 msN/A
LLRT35 ms65 ms80 ms140 ms8.1 MB
Porffor (native)8 ms18 msN/A*N/A*16 KB
GraalVM native-image65 ms130 ms95 ms180 ms38 MB
Java 21 + SnapStart180 ms350 ms220 ms410 msN/A
Rust (musl static)5 ms9 ms6 ms12 ms7.2 MB
Go7 ms14 ms9 ms18 ms12 MB
Provisioned Concurrency0 ms0 ms0 ms0 msN/A

*Porffor cannot run Workload B: it lacks JSON.parse with complex schemas, process.env, and the object spread operator in the current release.

1024 MB Lambda

ApproachA: P50A: P99B: P50B: P99
Node.js 22 (managed)85 ms140 ms280 ms450 ms
LLRT20 ms40 ms48 ms85 ms
Porffor (native)4 ms10 msN/AN/A
GraalVM native-image40 ms80 ms60 ms110 ms
Java 21 + SnapStart120 ms230 ms150 ms280 ms
Rust (musl static)3 ms6 ms4 ms8 ms
Go4 ms9 ms6 ms12 ms

Key observations:

  • Rust and Go are the floor. No JavaScript approach matches them on cold starts, and both handle Workload B without any language-level limitations.
  • Porffor matches Rust on Workload A but cannot run anything resembling a production handler. The 12x claim versus Node.js is directionally correct for trivial handlers.
  • LLRT is the practical JavaScript winner. 3x faster than Node.js cold starts, partial npm compatibility, and it handles Workload B. For JavaScript-committed teams, this is the pragmatic choice today.
  • Increasing memory from 128 MB to 1024 MB reduces cold starts 20–40% across all approaches. The effect is most pronounced for managed runtimes (Node.js, Java) where CPU-bound initialization benefits from the additional compute.
  • SnapStart halves Java cold starts but still leaves them an order of magnitude slower than native binaries. It is free, though, which makes it the obvious first step for any Java Lambda.

9. Memory Footprint Analysis

Cold-start latency gets the attention, but resident memory determines your minimum viable Lambda size and therefore your per-ms billing rate. A runtime that requires 100 MB of resident memory at idle cannot run in a 128 MB Lambda without OOM risk.

ApproachIdle RSSWorkload B RSSMin Viable Lambda
Node.js 2252 MB85 MB128 MB
LLRT18 MB32 MB128 MB
Porffor2 MBN/A128 MB
GraalVM native-image45 MB72 MB128 MB
Java 21 (JVM)120 MB180 MB256 MB
Rust4 MB8 MB128 MB
Go8 MB15 MB128 MB

Porffor's 2 MB idle RSS is remarkable — it reflects the absence of a runtime VM. Rust is similarly lean at 4 MB. In contrast, the JVM requires 120 MB just to exist, which is why Java Lambdas typically need at least 256 MB to avoid OOM under load.

Memory efficiency has a direct billing impact. If your Rust handler runs comfortably at 128 MB but your Node.js handler needs 256 MB, the Node.js function costs 2x per millisecond of execution — before accounting for the longer cold start. Over millions of invocations, this adds up.


10. Billing Model Analysis

Lambda bills on two axes: number of requests ($0.20 per million) and compute time (GB-seconds). The request charge is identical for all approaches. The compute charge depends on memory allocation and execution duration.

For a workload processing 1 million requests per month with a 50 ms average execution time (excluding cold-start init), 5% cold-start rate:

ApproachMemoryCold Start OverheadMonthly Costvs. Node.js
Node.js 22256 MB420 ms × 50K$3.68baseline
LLRT128 MB80 ms × 50K$1.48-60%
Rust128 MB6 ms × 50K$1.14-69%
Go128 MB9 ms × 50K$1.14-69%
GraalVM native-image128 MB95 ms × 50K$1.52-59%
Java + SnapStart256 MB220 ms × 50K$2.95-20%
Provisioned (10)256 MB0 ms$122.80+3238%

At 1 million requests per month, the billing difference between approaches is single-digit dollars. This is the uncomfortable truth about cold-start optimization from a pure cost perspective: it almost never pays for itself in Lambda billing. The value is in latency reduction, not cost reduction.

Provisioned Concurrency stands out as the cost outlier. At $122.80/month for 10 instances, it costs 33x more than the Rust approach while solving the same problem. It is the right choice only when you need contractually guaranteed latency (SLAs) and cannot change your runtime.

Medhurst's "2x cheaper" claim for Porffor versus Node.js is directionally correct but applies to a narrow scenario: the same memory allocation (128 MB) with a trivial handler where cold-start duration dominates total billed time. In realistic workloads where execution time dwarfs init time, the billing difference between any two approaches converges toward zero.


11. Decision Matrix

Choosing the right approach depends on your constraints, not on which benchmark number is smallest.

  1. You write Java and cannot rewrite: Enable SnapStart. It is free, requires no code changes, and cuts cold starts by 60–80%. If that is not enough, evaluate GraalVM native-image, accepting the build-time complexity and reflection configuration overhead.
  2. You write JavaScript and cannot rewrite: Deploy LLRT as your custom runtime. It cuts cold starts by 3x with minimal code changes. Watch Porffor's development — if it reaches 90%+ Test262 conformance and adds Node.js module support, it becomes a serious contender.
  3. You can choose your language: Write your Lambda in Rust. Cold starts of 3–8 ms, 4 MB memory footprint, full language capabilities, and the lowest possible compute cost. The cargo-lambda toolchain makes deployment straightforward.
  4. You need guaranteed zero cold starts: Provisioned Concurrency is the only approach that eliminates cold starts by definition. Budget $120+/month per function and right-size aggressively using Application Auto Scaling.
  5. You need fast cold starts but prioritize throughput: Use the standard JVM with SnapStart. The JIT compiler will optimize your hot paths in ways AOT compilation cannot. The cold-start penalty is moderate; the warm performance is superior.

The systems engineering perspective: Cold-start optimization is a latency problem, not a cost problem. At Lambda's billing granularity, the cost difference between approaches is negligible for most workloads. Choose based on P99 latency requirements and team capabilities, not on fractions of a cent per thousand invocations.


Sources & Further Reading


← cd ~