Performance Tips for C++ Developers (2026 Edition)
An updated, fact-checked guide to writing faster C++ on modern hardware and modern compilers. Every tip here is backed by how CPUs actually work — no myths, no cargo-culting.
The single most important rule: measure first. Every tip below can help or hurt depending on your workload. Use a profiler before changing anything. Use a benchmark after.
1. Memory Layout Is King
On modern hardware, a cache miss to main memory costs 100-300 CPU cycles. An L1 cache hit costs ~4 cycles. How your data is arranged in memory matters more than almost any other optimization.
Prefer contiguous data
// Slow: pointer chasing, poor cache locality std::list<Widget> widgets; // Fast: contiguous memory, prefetcher-friendly std::vector<Widget> widgets;
std::vector should be your default container. Even for insertions and deletions in the middle, it often beats std::list up to surprisingly large sizes because sequential memory access is that much faster than pointer chasing.
Think about data layout
When you iterate over objects but only touch a few fields, an array-of-structs layout wastes cache lines loading fields you never read.
// Array of Structs: every iteration loads the full struct into cache struct Particle { glm::vec3 position; glm::vec3 velocity; float mass; int id; std::string name; // you never read this in the hot loop uint64_t created_at; // or this }; std::vector<Particle> particles; // Struct of Arrays: only the data you need is in cache struct Particles { std::vector<glm::vec3> positions; std::vector<glm::vec3> velocities; std::vector<float> masses; std::vector<int> ids; std::vector<std::string> names; std::vector<uint64_t> created_at; };
If your hot loop only touches positions and velocities, the SoA layout means every cache line is full of data you actually use.
Align to cache lines
// C++17 alignas struct alignas(64) CacheAlignedData { std::array<float, 16> values; };
False sharing occurs when two threads write to different variables that happen to share a cache line. Aligning per-thread data to 64-byte boundaries eliminates this.
2. Avoid Unnecessary Copies
C++ passes by value by default. For anything larger than a pointer, this means an allocation and a memcpy you probably didn't want.
Use const references for read-only parameters
// Copies the entire string on every call void process(std::string str); // Zero-copy: just passes a pointer void process(const std::string& str);
Use std::string_view for non-owning string access
// Accepts string literals, std::string, or substrings — no allocation void process(std::string_view sv);
std::string_view avoids copies and avoids the overhead of constructing a std::string from a literal. Prefer it over const std::string& for function parameters where you don't need ownership.
Use move semantics when transferring ownership
// Sink parameter: the caller is done with this data void store_name(std::string name) { m_name = std::move(name); } // Caller side: store_name(std::move(temp_name)); // no copy, just pointer swap
Move semantics turn what would be an allocation + memcpy into a pointer swap. This is one of the biggest performance wins in modern C++.
Use std::span for non-owning array access (C++20)
// Works with vector, array, C arrays — zero overhead void process_data(std::span<const float> data);
3. Choose the Right Algorithm and Data Structure
No amount of micro-optimization saves a bad algorithm. An O(n) lookup called in a loop becomes O(n×m). Changing it to O(1) with a hash map dwarfs any other optimization you could make.
// O(n) lookup per query std::vector<std::pair<std::string, int>> lookup_table; auto it = std::ranges::find_if(lookup_table, [&](auto& p) { return p.first == key; }); // O(1) amortized lookup per query std::unordered_map<std::string, int> lookup_table; auto it = lookup_table.find(key);
Know your containers:
| Container | Random Access | Insert/Delete | Lookup | Cache Friendly |
|---|---|---|---|---|
std::vector | O(1) | O(n) | O(n) | Excellent |
std::array | O(1) | N/A | O(n) | Excellent |
std::unordered_map | N/A | O(1) avg | O(1) avg | Poor |
std::map | N/A | O(log n) | O(log n) | Poor |
std::flat_map (C++23) | N/A | O(n) | O(log n) | Excellent |
C++23's std::flat_map stores keys and values in sorted contiguous arrays, giving you ordered-map semantics with vector-like cache performance. For read-heavy workloads with infrequent inserts, it can dramatically outperform std::map.
4. Compile-Time Computation
Work that runs at compile time has zero runtime cost. Modern C++ has massively expanded what you can compute at compile time.
// C++23: constexpr works with most of the standard library now constexpr auto compute_table() { std::array<int, 256> table{}; for (int i = 0; i < 256; ++i) { table[i] = (i * i) % 256; } return table; } // This entire table is baked into the binary at compile time constexpr auto lookup = compute_table();
// consteval guarantees compile-time execution — build error if it can't consteval int factorial(int n) { int result = 1; for (int i = 2; i <= n; ++i) result *= i; return result; } static_assert(factorial(10) == 3628800);
As of C++23, constexpr supports dynamic memory allocation (within the evaluation), std::string, std::vector, std::unique_ptr, and most <algorithm> functions. If a computation doesn't depend on runtime input, see if it can be constexpr.
5. Virtual Functions: The Nuanced Truth
Virtual calls go through a vtable — an indirect function pointer lookup. The cost of that indirection itself is small on modern CPUs with good branch predictors. The real cost is that the compiler cannot inline a virtual call, which blocks downstream optimizations like constant folding, dead code elimination, and loop optimization.
When it matters
// Hot loop calling a virtual function millions of times: // the lost inlining opportunity is significant for (auto& shape : shapes) { shape->area(); // compiler can't inline this }
Alternatives when performance is critical
// Static polymorphism via CRTP — fully inlinable template<typename Derived> struct ShapeBase { float area() const { return static_cast<const Derived*>(this)->area_impl(); } };
// std::variant — closed set of types, no heap allocation, fully inlinable using Shape = std::variant<Circle, Rectangle, Triangle>; float total = 0; for (auto& shape : shapes) { total += std::visit([](auto& s) { return s.area(); }, shape); }
When virtual functions are fine
If the call isn't in a hot loop, virtual dispatch is perfectly fast. Don't contort your design to avoid virtual functions in cold code paths. Profile first.
6. SIMD and Vectorization
Modern CPUs can process 4, 8, or 16 values in a single instruction using SIMD (Single Instruction, Multiple Data).
Help the auto-vectorizer
// Vectorizable: simple loop, no dependencies between iterations void scale(float* __restrict__ out, const float* __restrict__ in, float factor, size_t n) { for (size_t i = 0; i < n; ++i) { out[i] = in[i] * factor; } }
The __restrict__ keyword tells the compiler that out and in don't alias (point to overlapping memory), enabling vectorization that would otherwise be unsafe.
Check compiler output
Use Compiler Explorer (godbolt.org) to verify your loops are actually vectorized. Look for instructions like vmulps, vaddps (AVX) or fmla (ARM NEON) in the output.
7. Compiler Flags: What Actually Helps
The safe defaults
| Flag | Effect |
|---|---|
-O2 | Good general optimization. The right default for production. |
-flto | Link-time optimization. Inlines and optimizes across translation units. Meaningful wins, no downside except longer builds. |
-march=native | Use all instruction sets your CPU supports. Only safe if the binary runs on the machine it was compiled on. |
Use with caution
| Flag | Caveat |
|---|---|
-O3 | Enables aggressive vectorization and inlining that increases code size. Can cause instruction cache pressure. Benchmark against -O2 — it's not always faster. |
-funroll-loops | Can help tight numerical loops. Can hurt everything else by bloating code. Apply per-file if benchmarks justify it. |
-ffast-math | Breaks IEEE 754 compliance. Reorders floating-point ops, changes results. Never use in financial or scientific code. Useful for games and audio. |
Profile-guided optimization (PGO)
PGO is one of the most effective and underused compiler optimizations:
# Step 1: Build instrumented binary g++ -O2 -fprofile-generate -o app_instrumented main.cpp # Step 2: Run it with representative workload ./app_instrumented < typical_input.txt # Step 3: Rebuild using the collected profile g++ -O2 -fprofile-use -o app_optimized main.cpp
PGO gives the compiler real data about which branches are taken, which functions are hot, and how large loops typically are. This consistently produces 10-20% speedups on real workloads.
8. Memory Allocation
Heap allocation (new, malloc) is expensive. Each call may involve a syscall, lock contention in multi-threaded programs, and fragmentation over time.
Use stack allocation when possible
// Heap allocation: ~50-100ns auto data = std::make_unique<std::array<float, 64>>(); // Stack allocation: ~0ns (just a stack pointer adjustment) std::array<float, 64> data{};
Use arena/pool allocators for many small allocations
// C++17 PMR (Polymorphic Memory Resource) #include <memory_resource> // 10KB stack buffer — all allocations come from here, no malloc calls std::array<std::byte, 10240> buffer; std::pmr::monotonic_buffer_resource pool{buffer.data(), buffer.size()}; std::pmr::vector<std::pmr::string> strings{&pool}; // These allocations are near-instant: just bump a pointer for (int i = 0; i < 100; ++i) { strings.emplace_back("example string"); }
Reserve capacity for vectors
std::vector<int> v; v.reserve(1000); // one allocation instead of ~10 reallocations for (int i = 0; i < 1000; ++i) { v.push_back(i); }
9. Branch Prediction and Branchless Code
Modern CPUs predict branch outcomes and speculatively execute the predicted path. Mispredictions cost ~15-20 cycles as the pipeline is flushed.
Help the predictor with hints (C++20)
if (value > 0) [[likely]] { // fast path: compiler arranges code layout for this case process(value); } else [[unlikely]] { // slow path: error handling, logging, etc. handle_error(); }
Go branchless in hot paths
// Branchy: pipeline stall on unpredictable data for (auto x : data) { if (x > threshold) sum += x; } // Branchless: no misprediction possible for (auto x : data) { sum += x * (x > threshold); }
This matters most when the branch is unpredictable (e.g., random data). If the branch is highly predictable (e.g., error checks that almost never fire), the branchy version is fine.
10. Profiling: Measure Everything
Never optimize without profiling. Your intuition about where time is spent is almost always wrong.
| Tool | Platform | Best For |
|---|---|---|
perf | Linux | Sampling profiler, cache miss analysis, branch mispredictions |
| Intel VTune | Linux/Windows | Deep microarchitectural analysis |
| Tracy | Cross-platform | Frame-based profiling for real-time applications |
| Superluminal | Windows | Low-overhead sampling profiler |
| Valgrind | Linux | Instruction-level profiling (slow but precise) |
| Google Benchmark | Cross-platform | Microbenchmarking individual functions |
Microbenchmarking with Google Benchmark
#include <benchmark/benchmark.h> static void BM_VectorPush(benchmark::State& state) { for (auto _ : state) { std::vector<int> v; for (int i = 0; i < 1000; ++i) v.push_back(i); benchmark::DoNotOptimize(v); } } BENCHMARK(BM_VectorPush); static void BM_VectorReserved(benchmark::State& state) { for (auto _ : state) { std::vector<int> v; v.reserve(1000); for (int i = 0; i < 1000; ++i) v.push_back(i); benchmark::DoNotOptimize(v); } } BENCHMARK(BM_VectorReserved);
Always use benchmark::DoNotOptimize to prevent the compiler from eliding the work you're trying to measure.
11. Modern C++ Features That Help Performance
std::expected (C++23) over exceptions for expected failure paths
// Exceptions are zero-cost on the happy path but extremely // expensive when thrown. For errors that happen regularly, // use std::expected instead. std::expected<Config, ParseError> parse_config(std::string_view input); auto result = parse_config(raw); if (!result) { log_error(result.error()); return default_config(); } use_config(*result);
Ranges and views for lazy evaluation (C++20/23)
// No intermediate allocations — everything is lazily evaluated auto result = data | std::views::filter([](int x) { return x > 0; }) | std::views::transform([](int x) { return x * x; }) | std::views::take(100);
std::mdspan for multidimensional array access (C++23)
// Zero-overhead multidimensional view over contiguous data std::vector<float> raw(rows * cols); std::mdspan matrix(raw.data(), rows, cols); float val = matrix[i, j]; // compiles to a single multiply + add
Summary
In rough order of impact:
- Pick the right algorithm and data structure — nothing else matters if this is wrong
- Fix your memory layout — cache misses dominate on modern hardware
- Avoid unnecessary copies — use moves, views, and references
- Reduce allocations — reserve, use arenas, prefer the stack
- Enable LTO and PGO — free performance from the compiler
- Help the auto-vectorizer — simple loops,
__restrict__, check the output - Profile before and after — intuition is unreliable; data is not
Don't optimize cold code. Don't add complexity for theoretical gains. Measure, change, measure again.