Compiler Engineering Notes

Compiler (ARM + ML deployment focus)

This page is focused on practical compiler knowledge used in performance work: AArch64 code generation, vectorization and memory behavior, and ONNX to TensorRT deployment. It drops toy labs and keeps the material in the shape engineers actually debug.

CPU ISA + ABIAArch64 + AAPCS64
Vector PathsNEON (128b) and SVE (VLA + predicates)
Middle-End CoreSSA IR, LICM, DCE
ML Runtime PathONNX -> TensorRT engine

ARM foundations that affect correctness and speed

If function boundaries are wrong, the code can compile and still fail in production. If loop locality is wrong, vector code can still be slow.

AArch64 ISA and AAPCS64 ABI

AArch64 is the Arm 64-bit instruction set. AAPCS64 is the calling convention contract. The backend and linker both depend on this contract being correct.

  • Integer args: first 8 in x0 to x7.
  • FP/SIMD args: first 8 in v0 to v7.
  • Return values: typically x0 or v0.
  • Stack rule: stack pointer must be 16-byte aligned at call boundaries.
// AAPCS64-sensitive boundary
extern long dot(const float* a, const float* b, long n);
// caller/callee must agree on register + stack contract

NEON/SVE vector paths + cache hierarchy

NEON uses fixed 128-bit vectors. SVE is vector-length agnostic and predicate-driven, so one loop structure can scale across different SVE widths.

Vectorization alone is not enough. A loop can be vectorized and still be memory-bound if it thrashes caches.

  • NEON: deterministic lanes and simpler unroll planning.
  • SVE: predicated tail handling without scalar cleanup loops.
  • Cache-aware transforms: tile loops so the active working set stays in L1/L2.
NEON path fixed 128-bit lanes explicit lane count, simple tail strategy SVE path vector-length agnostic predicated loop body + tail in same form Cache path (what decides runtime) L1 -> L2 -> LLC -> DRAM tiling + stride shape decide hit rate and bandwidth pressure L1 L2 LLC DRAM Rule of thumb Keep hot tiles in L1/L2 first. Then worry about wider vectors.

SSA IR, LICM, and DCE

SSA means each temporary is assigned once. That one rule makes data flow explicit, which is why most optimizer pipelines are built around SSA-form IR.

SSA in one minute

# non-SSA
x = a + b
x = x * c
return x

# SSA
%t0 = add a, b
%t1 = mul %t0, c
ret %t1

At control-flow joins, SSA uses a phi node to pick the incoming value from each predecessor block.

Pass behavior (concise)

  • LICM: hoist loop-invariant computations out of loops to reduce repeated work.
  • DCE: remove instructions whose results are never used and have no side effects.
  • Kernel fusion (ML): combine adjacent ops (for example Conv+BN+ReLU) to reduce memory traffic and launches.
Stage IR snapshot What changed
Input IR %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
%dead = add 0, 0
Invariant %k is inside hot path by placement, and %dead is unused.
After LICM preheader: %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
%dead = add 0, 0
Invariant multiply moved out of loop.
After DCE preheader: %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
Unused computation removed. Final loop body is smaller and easier to schedule.

ML compilation path: ONNX to TensorRT

In deployment-heavy teams, the operational chain is usually: export model to ONNX, build TensorRT engine for a target GPU, benchmark, and iterate on precision and fusion.
Training framework PyTorch / TF / JAX ONNX graph exchange boundary TensorRT builder kernel tactics + fusion Engine plan target-specific binary Runtime execute latency / throughput profiling

MLIR

MLIR is a multi-level compiler framework with dialects for different abstraction levels. It is often used to express graph-level transforms and lowering in one pipeline.

XLA / HLO

XLA compiles TensorFlow and JAX programs through HLO or StableHLO. Fusion and layout assignment are central because they reduce memory movement between ops.

ONNX + TensorRT

ONNX is the interchange layer. TensorRT parses the graph, picks tactics, applies legal fusions, and emits a serialized engine for the exact target hardware.

# common deployment path
python export.py --format onnx --out model.onnx
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16
trtexec --loadEngine=model.plan --shapes=input:1x3x224x224

GPU terms that still matter in ML deployment

Term Short explanation Where you care
NVVM NVIDIA's LLVM-based device compiler IR/toolchain layer that feeds PTX generation for CUDA device code. Compiler diagnostics and backend behavior.
PTX Virtual ISA for NVIDIA GPUs. Not final machine code. Inspecting generated code portability and compile paths.
SASS Final GPU machine instructions for a specific architecture (for example, one SM generation). Low-level profiling and micro-optimization.
fatbin Container bundling multiple compiled GPU targets (cubin/SASS) and often PTX fallback. Binary distribution across mixed GPU fleets.
Warp scheduling SM warp schedulers select ready warps each cycle to hide latency by switching warps on stalls. Understanding occupancy, stall reasons, and achieved throughput.
HBM High Bandwidth Memory: stacked DRAM with very high aggregate bandwidth. Memory-bound kernels and batching strategy decisions.
Kernel fusion Merging adjacent operations into fewer kernels to reduce memory round-trips and launch overhead. TensorRT and XLA graph optimization outcomes.

GCC/G++ flags: why they help and what they do to AArch64 assembly

If optimization flags feel like magic, inspect assembly. Every speedup claim here ties back to code shape changes you can see in the generated instructions.

Mental model: flag enables compiler passes, passes change IR and layout, and that rewrite becomes different assembly and different runtime behavior.

Flag to pass to assembly to runtime

Compile flags -O3, -flto, -mcpu, ... IR passes inlining, vectorizer, LICM, DCE AArch64 assembly calls, loads/stores, branches, SIMD Runtime behavior cycles, cache misses, branch stalls

Baseline kernel used in the examples

void saxpy(float* y, const float* x, float a, int n) {
    for (int i = 0; i < n; ++i) {
        y[i] = a * x[i] + y[i];
    }
}

We keep the same loop and only change flags. That way, assembly diffs show exactly what each option buys you.

1) -O levels: debug shape vs throughput shape

-O0 (debug shape)

Why it helps: keeps source-to-assembly mapping simple for stepping and breakpoints.

Assembly translation: more stack traffic and fewer register optimizations.

stp x29, x30, [sp, #-48]!
mov x29, sp
...
ldr s0, [x1, x3, lsl #2]
str s0, [sp, #28]      // spill
ldr s0, [sp, #28]      // reload
bl helper              // call often remains

When it backfires: numbers from this build do not represent production speed.

-O2 (balanced release baseline)

Why it helps: enables strong scalar optimizations without extreme code growth.

Assembly translation: fewer spills, tighter loop body, better scheduling.

mov x3, #0
.Lloop:
ldr s0, [x1, x3, lsl #2]
ldr s1, [x0, x3, lsl #2]
fmadd s1, s0, s2, s1
str s1, [x0, x3, lsl #2]
add x3, x3, #1
cmp x3, x2
b.ne .Lloop

When it backfires: hidden UB can appear only in optimized builds.

-O3 (more aggressive loop transforms)

Why it helps: pushes harder on vectorization/unrolling for compute-heavy hot loops.

Assembly translation: often emits NEON loop body plus scalar tail.

dup v2.4s, w3
.Lvec:
ld1 {v0.4s}, [x1], #16
ld1 {v1.4s}, [x0], #16
fmla v1.4s, v0.4s, v2.4s
st1 {v1.4s}, [x0], #16
subs x2, x2, #4
b.gt .Lvec

When it backfires: bigger code can hurt I-cache and reduce gains on branchy code.

-Ofast / -ffast-math

Why it helps: allows algebraic rewrites that strict IEEE mode blocks.

Assembly translation: more fusion and reassociation opportunities.

// strict math
fmul s0, s1, s2
fadd s0, s0, s3

// fast-math path
fmadd s0, s1, s2, s3

When it backfires: NaN/infinity/rounding behavior can change. Use only if numerics allow it.

2) Cross-file and profile-guided optimizations

-flto (link-time optimization)

Why it helps: compiler sees across translation units and can inline across file boundaries.

Assembly translation: direct call can disappear and turn into straight-line ops.

// no LTO
bl _Z12update_blockPfPKf

// with LTO
ld1 {v0.4s}, [x1]
ld1 {v1.4s}, [x0]
fmla v1.4s, v0.4s, v2.4s
st1 {v1.4s}, [x0]

When it backfires: longer links and larger binaries on huge codebases.

-fprofile-generate / -fprofile-use (PGO)

Why it helps: hot path data drives block ordering and branch prediction friendly layout.

Assembly translation: hot path tends to become fall-through; cold blocks move away.

cbz w0, .Lcold
// hot path falls through
...
b .Ldone
.Lcold:
...

When it backfires: stale or unrepresentative training runs can produce slower code.

3) -mcpu, -march, -mtune on ARM

What each flag controls

  • -march: ISA features allowed (for example SVE availability).
  • -mcpu: target core plus tuning defaults; most practical single switch for fixed hardware.
  • -mtune: scheduling model without changing ISA baseline.

Assembly translation

If SVE is enabled in -march, GCC can emit predicate-driven vector loops:

whilelt p0.s, x3, x2
ld1w z0.s, p0/z, [x1, x3, lsl #2]
ld1w z1.s, p0/z, [x0, x3, lsl #2]
fmla z1.s, p0/m, z0.s, z2.s
st1w z1.s, p0, [x0, x3, lsl #2]
incw x3
b.any .Lsveloop

Avoid -march=native for distributable binaries; it locks output to build host features.

4) Debuggability flag that affects code shape

-fno-omit-frame-pointer

Why it helps: stable unwind chains for profilers and postmortem stacks.

stp x29, x30, [sp, #-32]!
mov x29, sp
...
ldp x29, x30, [sp], #32
ret

What you trade away

Assembly translation: one extra reserved register and extra prologue/epilogue work.

When it backfires: tiny runtime cost in very hot functions, but often worth it in production diagnostics.

5) Inspect it yourself (recommended workflow)

# emit assembly with comments
g++ -O2 -mcpu=cortex-a72 -fverbose-asm -S saxpy.cpp -o saxpy_O2.s
g++ -O3 -mcpu=cortex-a72 -fverbose-asm -S saxpy.cpp -o saxpy_O3.s

# inspect final machine code
g++ -O3 -mcpu=cortex-a72 saxpy.cpp -o saxpy
objdump -d -M no-aliases saxpy | less

# vectorizer report (what loop transformed and why)
g++ -O3 -mcpu=cortex-a72 -fopt-info-vec-optimized -fopt-info-vec-missed saxpy.cpp -c

# optional: dump optimizer decisions on SSA/IR pipeline
g++ -O3 -fdump-tree-all saxpy.cpp -c

Practical optimization checklist

  1. Validate ABI/calling convention assumptions before speed work. Wrong ABI can look like random runtime corruption.
  2. Measure memory traffic first. If DRAM dominates, vector width changes alone will not move the needle much.
  3. Check SSA-level opportunities: invariants, dead values, and redundant loads.
  4. For ML deployment, treat ONNX export quality and TensorRT tactic/fusion reports as first-class debugging inputs.
  5. Only dive into PTX/SASS when profiler data shows kernel-level bottlenecks that higher-level fusion/layout changes cannot solve.