eric@ericcox.com:~/blog — home ~15 min read
← cd ~

Performance Tips for C++ Developers (2026 Edition)

March 14, 2026 · Eric Cox

An updated, fact-checked guide to writing faster C++ on modern hardware and modern compilers. Every tip here is backed by how CPUs actually work — no myths, no cargo-culting.

The single most important rule: measure first. Every tip below can help or hurt depending on your workload. Use a profiler before changing anything. Use a benchmark after.


1. Memory Layout Is King

On modern hardware, a cache miss to main memory costs 100-300 CPU cycles. An L1 cache hit costs ~4 cycles. How your data is arranged in memory matters more than almost any other optimization.

Prefer contiguous data

// Slow: pointer chasing, poor cache locality
std::list<Widget> widgets;

// Fast: contiguous memory, prefetcher-friendly
std::vector<Widget> widgets;

std::vector should be your default container. Even for insertions and deletions in the middle, it often beats std::list up to surprisingly large sizes because sequential memory access is that much faster than pointer chasing.

Think about data layout

When you iterate over objects but only touch a few fields, an array-of-structs layout wastes cache lines loading fields you never read.

// Array of Structs: every iteration loads the full struct into cache
struct Particle {
    glm::vec3 position;
    glm::vec3 velocity;
    float mass;
    int id;
    std::string name;      // you never read this in the hot loop
    uint64_t created_at;  // or this
};
std::vector<Particle> particles;

// Struct of Arrays: only the data you need is in cache
struct Particles {
    std::vector<glm::vec3> positions;
    std::vector<glm::vec3> velocities;
    std::vector<float> masses;
    std::vector<int> ids;
    std::vector<std::string> names;
    std::vector<uint64_t> created_at;
};

If your hot loop only touches positions and velocities, the SoA layout means every cache line is full of data you actually use.

Align to cache lines

// C++17 alignas
struct alignas(64) CacheAlignedData {
    std::array<float, 16> values;
};

False sharing occurs when two threads write to different variables that happen to share a cache line. Aligning per-thread data to 64-byte boundaries eliminates this.


2. Avoid Unnecessary Copies

C++ passes by value by default. For anything larger than a pointer, this means an allocation and a memcpy you probably didn't want.

Use const references for read-only parameters

// Copies the entire string on every call
void process(std::string str);

// Zero-copy: just passes a pointer
void process(const std::string& str);

Use std::string_view for non-owning string access

// Accepts string literals, std::string, or substrings — no allocation
void process(std::string_view sv);

std::string_view avoids copies and avoids the overhead of constructing a std::string from a literal. Prefer it over const std::string& for function parameters where you don't need ownership.

Use move semantics when transferring ownership

// Sink parameter: the caller is done with this data
void store_name(std::string name) {
    m_name = std::move(name);
}

// Caller side:
store_name(std::move(temp_name));  // no copy, just pointer swap

Move semantics turn what would be an allocation + memcpy into a pointer swap. This is one of the biggest performance wins in modern C++.

Use std::span for non-owning array access (C++20)

// Works with vector, array, C arrays — zero overhead
void process_data(std::span<const float> data);

3. Choose the Right Algorithm and Data Structure

No amount of micro-optimization saves a bad algorithm. An O(n) lookup called in a loop becomes O(n×m). Changing it to O(1) with a hash map dwarfs any other optimization you could make.

// O(n) lookup per query
std::vector<std::pair<std::string, int>> lookup_table;
auto it = std::ranges::find_if(lookup_table, [&](auto& p) {
    return p.first == key;
});

// O(1) amortized lookup per query
std::unordered_map<std::string, int> lookup_table;
auto it = lookup_table.find(key);

Know your containers:

ContainerRandom AccessInsert/DeleteLookupCache Friendly
std::vectorO(1)O(n)O(n)Excellent
std::arrayO(1)N/AO(n)Excellent
std::unordered_mapN/AO(1) avgO(1) avgPoor
std::mapN/AO(log n)O(log n)Poor
std::flat_map (C++23)N/AO(n)O(log n)Excellent

C++23's std::flat_map stores keys and values in sorted contiguous arrays, giving you ordered-map semantics with vector-like cache performance. For read-heavy workloads with infrequent inserts, it can dramatically outperform std::map.


4. Compile-Time Computation

Work that runs at compile time has zero runtime cost. Modern C++ has massively expanded what you can compute at compile time.

// C++23: constexpr works with most of the standard library now
constexpr auto compute_table() {
    std::array<int, 256> table{};
    for (int i = 0; i < 256; ++i) {
        table[i] = (i * i) % 256;
    }
    return table;
}

// This entire table is baked into the binary at compile time
constexpr auto lookup = compute_table();
// consteval guarantees compile-time execution — build error if it can't
consteval int factorial(int n) {
    int result = 1;
    for (int i = 2; i <= n; ++i) result *= i;
    return result;
}

static_assert(factorial(10) == 3628800);

As of C++23, constexpr supports dynamic memory allocation (within the evaluation), std::string, std::vector, std::unique_ptr, and most <algorithm> functions. If a computation doesn't depend on runtime input, see if it can be constexpr.


5. Virtual Functions: The Nuanced Truth

Virtual calls go through a vtable — an indirect function pointer lookup. The cost of that indirection itself is small on modern CPUs with good branch predictors. The real cost is that the compiler cannot inline a virtual call, which blocks downstream optimizations like constant folding, dead code elimination, and loop optimization.

When it matters

// Hot loop calling a virtual function millions of times:
// the lost inlining opportunity is significant
for (auto& shape : shapes) {
    shape->area();  // compiler can't inline this
}

Alternatives when performance is critical

// Static polymorphism via CRTP — fully inlinable
template<typename Derived>
struct ShapeBase {
    float area() const {
        return static_cast<const Derived*>(this)->area_impl();
    }
};
// std::variant — closed set of types, no heap allocation, fully inlinable
using Shape = std::variant<Circle, Rectangle, Triangle>;

float total = 0;
for (auto& shape : shapes) {
    total += std::visit([](auto& s) { return s.area(); }, shape);
}

When virtual functions are fine

If the call isn't in a hot loop, virtual dispatch is perfectly fast. Don't contort your design to avoid virtual functions in cold code paths. Profile first.


6. SIMD and Vectorization

Modern CPUs can process 4, 8, or 16 values in a single instruction using SIMD (Single Instruction, Multiple Data).

Help the auto-vectorizer

// Vectorizable: simple loop, no dependencies between iterations
void scale(float* __restrict__ out, const float* __restrict__ in,
          float factor, size_t n) {
    for (size_t i = 0; i < n; ++i) {
        out[i] = in[i] * factor;
    }
}

The __restrict__ keyword tells the compiler that out and in don't alias (point to overlapping memory), enabling vectorization that would otherwise be unsafe.

Check compiler output

Use Compiler Explorer (godbolt.org) to verify your loops are actually vectorized. Look for instructions like vmulps, vaddps (AVX) or fmla (ARM NEON) in the output.


7. Compiler Flags: What Actually Helps

The safe defaults

FlagEffect
-O2Good general optimization. The right default for production.
-fltoLink-time optimization. Inlines and optimizes across translation units. Meaningful wins, no downside except longer builds.
-march=nativeUse all instruction sets your CPU supports. Only safe if the binary runs on the machine it was compiled on.

Use with caution

FlagCaveat
-O3Enables aggressive vectorization and inlining that increases code size. Can cause instruction cache pressure. Benchmark against -O2 — it's not always faster.
-funroll-loopsCan help tight numerical loops. Can hurt everything else by bloating code. Apply per-file if benchmarks justify it.
-ffast-mathBreaks IEEE 754 compliance. Reorders floating-point ops, changes results. Never use in financial or scientific code. Useful for games and audio.

Profile-guided optimization (PGO)

PGO is one of the most effective and underused compiler optimizations:

# Step 1: Build instrumented binary
g++ -O2 -fprofile-generate -o app_instrumented main.cpp

# Step 2: Run it with representative workload
./app_instrumented < typical_input.txt

# Step 3: Rebuild using the collected profile
g++ -O2 -fprofile-use -o app_optimized main.cpp

PGO gives the compiler real data about which branches are taken, which functions are hot, and how large loops typically are. This consistently produces 10-20% speedups on real workloads.


8. Memory Allocation

Heap allocation (new, malloc) is expensive. Each call may involve a syscall, lock contention in multi-threaded programs, and fragmentation over time.

Use stack allocation when possible

// Heap allocation: ~50-100ns
auto data = std::make_unique<std::array<float, 64>>();

// Stack allocation: ~0ns (just a stack pointer adjustment)
std::array<float, 64> data{};

Use arena/pool allocators for many small allocations

// C++17 PMR (Polymorphic Memory Resource)
#include <memory_resource>

// 10KB stack buffer — all allocations come from here, no malloc calls
std::array<std::byte, 10240> buffer;
std::pmr::monotonic_buffer_resource pool{buffer.data(), buffer.size()};
std::pmr::vector<std::pmr::string> strings{&pool};

// These allocations are near-instant: just bump a pointer
for (int i = 0; i < 100; ++i) {
    strings.emplace_back("example string");
}

Reserve capacity for vectors

std::vector<int> v;
v.reserve(1000);  // one allocation instead of ~10 reallocations
for (int i = 0; i < 1000; ++i) {
    v.push_back(i);
}

9. Branch Prediction and Branchless Code

Modern CPUs predict branch outcomes and speculatively execute the predicted path. Mispredictions cost ~15-20 cycles as the pipeline is flushed.

Help the predictor with hints (C++20)

if (value > 0) [[likely]] {
    // fast path: compiler arranges code layout for this case
    process(value);
} else [[unlikely]] {
    // slow path: error handling, logging, etc.
    handle_error();
}

Go branchless in hot paths

// Branchy: pipeline stall on unpredictable data
for (auto x : data) {
    if (x > threshold) sum += x;
}

// Branchless: no misprediction possible
for (auto x : data) {
    sum += x * (x > threshold);
}

This matters most when the branch is unpredictable (e.g., random data). If the branch is highly predictable (e.g., error checks that almost never fire), the branchy version is fine.


10. Profiling: Measure Everything

Never optimize without profiling. Your intuition about where time is spent is almost always wrong.

ToolPlatformBest For
perfLinuxSampling profiler, cache miss analysis, branch mispredictions
Intel VTuneLinux/WindowsDeep microarchitectural analysis
TracyCross-platformFrame-based profiling for real-time applications
SuperluminalWindowsLow-overhead sampling profiler
ValgrindLinuxInstruction-level profiling (slow but precise)
Google BenchmarkCross-platformMicrobenchmarking individual functions

Microbenchmarking with Google Benchmark

#include <benchmark/benchmark.h>

static void BM_VectorPush(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v;
        for (int i = 0; i < 1000; ++i)
            v.push_back(i);
        benchmark::DoNotOptimize(v);
    }
}
BENCHMARK(BM_VectorPush);

static void BM_VectorReserved(benchmark::State& state) {
    for (auto _ : state) {
        std::vector<int> v;
        v.reserve(1000);
        for (int i = 0; i < 1000; ++i)
            v.push_back(i);
        benchmark::DoNotOptimize(v);
    }
}
BENCHMARK(BM_VectorReserved);

Always use benchmark::DoNotOptimize to prevent the compiler from eliding the work you're trying to measure.


11. Modern C++ Features That Help Performance

std::expected (C++23) over exceptions for expected failure paths

// Exceptions are zero-cost on the happy path but extremely
// expensive when thrown. For errors that happen regularly,
// use std::expected instead.
std::expected<Config, ParseError> parse_config(std::string_view input);

auto result = parse_config(raw);
if (!result) {
    log_error(result.error());
    return default_config();
}
use_config(*result);

Ranges and views for lazy evaluation (C++20/23)

// No intermediate allocations — everything is lazily evaluated
auto result = data
    | std::views::filter([](int x) { return x > 0; })
    | std::views::transform([](int x) { return x * x; })
    | std::views::take(100);

std::mdspan for multidimensional array access (C++23)

// Zero-overhead multidimensional view over contiguous data
std::vector<float> raw(rows * cols);
std::mdspan matrix(raw.data(), rows, cols);
float val = matrix[i, j];  // compiles to a single multiply + add

Summary

In rough order of impact:

  1. Pick the right algorithm and data structure — nothing else matters if this is wrong
  2. Fix your memory layout — cache misses dominate on modern hardware
  3. Avoid unnecessary copies — use moves, views, and references
  4. Reduce allocations — reserve, use arenas, prefer the stack
  5. Enable LTO and PGO — free performance from the compiler
  6. Help the auto-vectorizer — simple loops, __restrict__, check the output
  7. Profile before and after — intuition is unreliable; data is not

Don't optimize cold code. Don't add complexity for theoretical gains. Measure, change, measure again.


← cd ~