Compiler

Compiler

Two tracks at equal depth. Host compilation for AArch64 codegen: loop passes, vectorization, GCC flag behavior. ML compilation for the path from a PyTorch CNN or LLM to a deployable artifact on a generic edge SoC.

Track 1C/C++ compilation optimization
Host Compiler CoreGCC pass effects on AArch64
Track 2ML compilation for edge SoCs
ML Compiler PathCNN + LLM → PyTorch → edge engine

C/C++ Compilation Optimization

Vectorization, loop passes, pointer aliasing, then GCC flags.

Writing code the auto-vectorizer can use

NEON and SVE prose covers what the hardware can do. This section covers what the source code has to look like for -O3 to actually emit a vectorized loop.

  • restrict on every pointer the loop touches. Removes the may-alias gate. See the aliasing section for the exact contract.
  • Alignment hints. alignas(16) or alignas(32) on the buffer, plus __builtin_assume_aligned(p, 16) at the loop entry. Lets the compiler pick aligned ld1 over unaligned ldur.
  • Vectorization pragmas. #pragma omp simd needs -fopenmp-simd (or -fopenmp) on GCC and Clang to take effect; without the flag the pragma is silently ignored. #pragma clang loop vectorize(enable) works without flags on Clang.
  • Contiguous unit-stride access. Strided or gather patterns force scalar fallback or expensive gather instructions.
  • Trip-count hints. Peel by a multiple of the vector width with if (n % 8 == 0) or __builtin_assume(n % 8 == 0) so the vectorizer skips the scalar tail.
  • What blocks vectorization: function calls inside the loop (unless the callee is __attribute__((const))), early break/return, dependent reductions written without reduction(+:x).

Before: scalar emit

C input:

void axpy(float *a, float *b, float *c,
          float k, int n) {
    for (int i = 0; i < n; ++i)
        c[i] = a[i] * k + b[i];
}

Three plain pointers, no alignment guarantee, no pragma. The compiler assumes possible aliasing and emits scalar fmadd per iteration.

After: 4-lane SIMD emit

C input:

void axpy(float * restrict a,
          float * restrict b,
          float * restrict c,
          float k, int n) {
    a = __builtin_assume_aligned(a, 16);
    b = __builtin_assume_aligned(b, 16);
    c = __builtin_assume_aligned(c, 16);
    #pragma omp simd
    for (int i = 0; i < n; ++i)
        c[i] = a[i] * k + b[i];
}

k arrives in s0 per AAPCS64. The compiler broadcasts it once with dup v3.4s, v0.s[0], then the body runs ld1 {v0.4s}, ld1 {v1.4s}, fmla v1.4s, v0.4s, v3.4s, st1 {v1.4s} over four float lanes per iteration, with a peeled scalar tail for the remainder.

C/C++ loop optimization (LICM + DCE)

Focus on which loop instructions disappeared and why.

Invariant code motion in plain terms

If a value doesn't change inside a loop, compute it once before the loop. Saves instructions every iteration and opens scheduling room.

// before LICM
for (int i = 0; i < n; ++i) {
    int stride = cols * 4;   // invariant but recomputed
    dst[i] = src[i * stride] + bias;
}

// after LICM
int stride = cols * 4;
for (int i = 0; i < n; ++i) {
    dst[i] = src[i * stride] + bias;
}

On memory-bound loops it's no miracle, but it removes useless ALU work and shrinks the hot loop body.

DCE in practical terms

Dead Code Elimination removes computations whose results are never used and have no side effects.

  • What is removed: unused temporaries, redundant arithmetic chains, dead stores.
  • Why it helps: smaller loop body, less scheduler pressure, clearer vectorization opportunities.
  • Developer signal: cleaner post-pass IR usually means cleaner assembly too.
// dead path example
int t0 = x * 4;
int t1 = t0 + 1;   // removed if never consumed
sum += x;
Stage IR snapshot What changed
Input IR %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
%dead = add 0, 0
Invariant %k sits in the hot loop and %dead has no consumer.
After LICM preheader: %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
%dead = add 0, 0
Invariant multiply moved out of loop. The multiply now runs once instead of trip_count times.
After DCE preheader: %k = mul n, 4
loop: %v = load A[i]
%u = add %v, c
Unused computation removed. Final loop body is smaller, so schedulers and vectorizers have less clutter.
LICM shape change Before loop: k = cols * 4 v = load A[i * k] out = v + bias After preheader: k = cols * 4 loop: v = load A[i * k] out = v + bias one fewer ALU op each trip

The preheader runs once before the loop. LICM hoists invariant math there so the loop body stops repeating it.

DCE shape change Before t0 = x * 4 t1 = t0 + 1 (dead) sum += x extra useless instruction After t0 = x * 4 sum += x dead instruction removed smaller loop body

A dead instruction burns cycles without changing the output. After DCE the loop body is tighter and the scheduler has more room.

Pointer aliasing and restrict

When two pointer parameters might overlap, the compiler reloads after every store and cannot vectorize. restrict (C99) and __restrict__ (GCC and Clang C++) tell the compiler that this pointer is the only path the function uses to reach that memory. Loads can then stay in registers across stores, hoist out of loops, and fold into SIMD lanes. Violating the contract is undefined behavior, and the compiler does not check.

Without restrict

C input:

void add(int *a,
         int *b,
         int *c, int n) {
    for (int i = 0; i < n; ++i)
        a[i] = b[i] + c[i];
}

The compiler must assume a may alias b or c. A store to a[i] could change a future b or c read, so loads cannot be hoisted, kept in registers, or fused into a SIMD load. Auto-vectorization gives up.

With restrict

C input:

void add(int * restrict a,
         const int * restrict b,
         const int * restrict c, int n) {
    for (int i = 0; i < n; ++i)
        a[i] = b[i] + c[i];
}

b and c loads fold into a single SIMD ld1 per iteration block, the store does not invalidate the register cache, and the loop emits four-lane add + st1 on AArch64.

Same rule applies in C++ on hot loops over arrays of POD: mark every pointer the loop reads or writes with __restrict__. The optimizer cannot infer it from const alone.

GCC/G++ flags: C → AArch64 codegen diffs

Each card pairs a C snippet with its AArch64 codegen and the optimization delta. Trust the emitted shape; flag names are only hints. Assembly is representative. Exact registers and tail-cleanup vary by compiler version.

AArch64 quick legend

  • sN vs vN.4s: sN is one float32 scalar; vN.4s is a 4-lane float32 vector.
  • ld1/st1 with #16: load/store 16 bytes and post-increment the pointer by 16.
  • cmp + b.ne: compare and branch while not equal.
  • subs + b.gt: subtract and branch while remaining count is positive.
  • whilelt p0.s, x3, x2 + b.none: build an SVE lane mask for valid elements; exit when no lanes are active.
  • fmadd/fmla: fused multiply-add. One rounded FP op.

1) O-levels on a typical C loop

-O0 (debug-first shape)

C input:

float sum_debug(const float* a, int n) {
    float s = 0.f;
    for (int i = 0; i < n; ++i) s += a[i];
    return s;
}

AArch64 output:

stp x29, x30, [sp, #-48]!
mov x29, sp
...
ldr s0, [x0, x3, lsl #2]
str s0, [sp, #28]      // spill
ldr s0, [sp, #28]      // reload
b .Lloop

Optimization delta: minimal transform pressure and larger loop body, which helps source-level debug but hurts throughput.

-O2 (release baseline)

C input:

void saxpy_o2(float* y, const float* x, float a, int n) {
    for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i];
}

AArch64 output:

fmov s2, s0            // keep incoming a in a dedicated scalar reg
mov w3, #0
.Lloop:
ldr s0, [x1, x3, lsl #2]
ldr s1, [x0, x3, lsl #2]
fmadd s1, s0, s2, s1
str s1, [x0, x3, lsl #2]
add w3, w3, #1
cmp w3, w2
b.ne .Lloop

Optimization delta: tighter scalar loop and better scheduling around load/use latency, with predictable code size.

-O3 (aggressive loop/vector transforms)

C input:

void relu_o3(float* dst, const float* src, int n) {
    for (int i = 0; i < n; ++i) dst[i] = src[i] > 0.f ? src[i] : 0.f;
}

AArch64 output:

// vector-body excerpt; full output also emits scalar tail handling
movi v2.4s, #0
.Lvec:
ld1 {v0.4s}, [x1], #16
fmax v0.4s, v0.4s, v2.4s
st1 {v0.4s}, [x0], #16
subs x2, x2, #4
b.gt .Lvec

Optimization delta: auto-vectorized body reduces loop overhead; gains depend on contiguous accesses and cache residency.

-Ofast / contraction-enabled FP math

C input:

float affine_fast(float x, float a, float b) {
    return a * x + b;
}

AArch64 output:

// contraction disabled (two rounding points)
fmul s3, s0, s1
fadd s0, s3, s2

// contraction enabled (policy + target dependent)
fmadd s0, s0, s1, s2

How this can change results: when contraction is enabled (for example -ffp-contract=fast or fast-math policy), the compiler may emit hardware FMADD. Fused execution rounds once at the end. The non-contracted path uses two instructions and rounds after each step.

// toy decimal model with 3-digit rounding
// strict: round(round(1.234 * 9.876) + (-12.18))
//      => round(12.19 + (-12.18)) = 0.01
// fused : round(1.234 * 9.876 + (-12.18))
//      => round(0.0078...) = 0.008

Optimization delta: contraction and reassociation can improve hot-loop throughput, but numeric behavior can shift. Validate before using in control or safety firmware paths.

2) Cross-file and profile-guided optimizations

-flto (link-time optimization)

C input:

// tu_a.c
float saxpy_step(float y, float x, float a) { return a * x + y; }
// tu_b.c
for (int i = 0; i < n; ++i) y[i] = saxpy_step(y[i], x[i], a);

AArch64 output:

// no LTO
bl saxpy_step

// with LTO
ldr s0, [x1, x3, lsl #2]
ldr s1, [x0, x3, lsl #2]
fmadd s1, s0, s2, s1

Optimization delta: inlines across translation units, removes call overhead, and exposes more scheduling room.

-fprofile-generate / -fprofile-use (PGO)

C input:

int decode_tag(unsigned v) {
    if (v < 240) return fast_table[v];
    return slow_path(v);
}

AArch64 output:

cmp w0, #239
b.hi .Lcold
// hot path falls through
ldr w1, [x2, x0, lsl #2]
b .Ldone
.Lcold:
bl slow_path

Optimization delta: branch layout follows recorded profile data so the hot path is shorter in front-end cycles.

3) CPU targeting on ARM

-mcpu / -mtune

C input:

void mac_core(float* y, const float* x, const float* k, int n) {
    for (int i = 0; i < n; ++i) y[i] += x[i] * k[i];
}

AArch64 output:

ld1 {v0.4s}, [x1], #16
ld1 {v1.4s}, [x2], #16
ld1 {v2.4s}, [x0]
fmla v2.4s, v0.4s, v1.4s
st1 {v2.4s}, [x0], #16

Optimization delta: the compiler changes unroll and scheduling strategy to match the selected core model.

-march (ISA enablement, SVE example)

C input:

void axpy_sve(float* y, const float* x, float a, int n) {
    for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i];
}

AArch64 output:

mov x3, #0
fmov w4, s0
dup z2.s, w4
.Lsveloop:
whilelt p0.s, x3, x2
b.none .Ldone
ld1w z0.s, p0/z, [x1, x3, lsl #2]
ld1w z1.s, p0/z, [x0, x3, lsl #2]
fmla z1.s, p0/m, z0.s, z2.s
st1w z1.s, p0, [x0, x3, lsl #2]
incw x3
b .Lsveloop
.Ldone:

Optimization delta: enabling ISA features changes the instruction family available to codegen, not just micro-tuning.

4) Frame pointer policy and debug reliability

Default omit-frame-pointer path

C input:

int checksum_fast(const int* p, int n) {
    int s = 0;
    for (int i = 0; i < n; ++i) s += p[i];
    return s;
}

AArch64 output:

sub sp, sp, #32
...
add sp, sp, #32
ret

Optimization delta: frees one register and trims prologue cost. Trade-off: some profilers and on-device crash workflows produce less stable stacks without frame pointers.

-fno-omit-frame-pointer

C input:

int checksum_traceable(const int* p, int n) {
    int s = 0;
    for (int i = 0; i < n; ++i) s += p[i];
    return s;
}

AArch64 output:

stp x29, x30, [sp, #-32]!
mov x29, sp
...
ldp x29, x30, [sp], #32
ret

Optimization delta: more predictable stacks for profilers and postmortem triage, with a small runtime and register cost in very hot loops.

Host C/C++ ends here. ML compilation for CNN and LLM workloads on a generic edge SoC starts below.

ML Compilation Track: CNN and LLM to Edge SoC

Same compiler ideas, different IR. A conv stack is compute-bound on the NPU; LLM decode is memory-bound on LPDDR. The workload sets the bottleneck, and the bottleneck sets which passes earn their keep.

ML compilation overview

ML compilation lowers a framework graph (PyTorch / TF / JAX) through a portable IR, applies graph-level rewrites (constant folding, layout transform, quantization, fusion), then emits a hardware-specific artifact for an edge SoC. The deployment lane below (ONNX → TensorRT) and the compiler lane (XLA / HLO / MLIR) are the two shapes this takes in practice.

Host compiler vs. ML compiler

Dimension Host compiler (GCC / Clang) ML compiler (TensorRT / XLA / TVM / MLIR)
Primary IR GIMPLE / LLVM IR (scalar + SIMD) ONNX / HLO / Linalg / TOSA (tensor ops with shape)
Unit of work Instruction / basic block / loop Tensor op / fused kernel / subgraph
Scheduling concern Register pressure, vector lanes, cache lines Tile size, layout, on-chip SRAM reuse, off-chip bandwidth
Key optimizations LICM, DCE, vectorization, inlining, peephole Constant fold, BN fold, layout transform, quantization, fusion
Output artifact ELF object / shared library Engine plan (TensorRT), GGUF blob (llama.cpp), .vmfb (IREE)
Dominant cost Compute throughput, branch prediction Memory movement, especially LPDDR ↔ SoC
Deployment lane: ONNX to TensorRT Framework graph PyTorch / TF / JAX ONNX portable graph TensorRT builder fusion, tactics, FP16/INT8 calibration Engine plan target binary TensorRT runtime nvinfer, on the GPU Compiler lane: XLA / HLO pass sequence High-level graph ops + tensor shapes HLO canonicalization normalize algebra and layout Fusion and tiling reduce memory traffic Lower to backend LLVM / PTX / target ISA Final kernels launch plan
Deployment lane: ONNX to TensorRT Framework graph PyTorch / TensorFlow / JAX ONNX (optional) PyTorch can also use Torch-TensorRT TensorRT builder fusion, tactics, FP16/INT8 calibration Engine plan serialized binary TensorRT runtime nvinfer, on the GPU Compiler lane: XLA on NVIDIA GPU Framework graph JAX / TensorFlow ops StableHLO / HLO XLA IR hosted in MLIR dialects Fusion + layout passes cut HBM round-trips PTX Parallel Thread Execution (virtual ISA) SASS final GPU machine code (per-arch)

MLIR

MLIR is a compiler infrastructure built around dialects (Linalg, Affine, Vector, GPU, LLVM). Pipelines progressively lower from high-level dialects down to a backend. StableHLO (XLA), IREE, and Torch-MLIR are all MLIR-based stacks.

XLA / HLO

XLA lowers TF/JAX into HLO. StableHLO is the portable interchange dialect for OpenXLA. Pass order: shape inference → canonicalization → fusion → layout assignment → backend lowering. Goal: preserve math, cut memory traffic and kernel count.

ONNX + TensorRT

ONNX is both a graph format and a runtime (onnxruntime). In the deployment lane it's the graph hand-off into TensorRT, which parses, picks tactics, applies legal fusions, and emits a serialized engine for the target GPU and precision profile.

# common deployment path (ONNX -> TensorRT)
python export.py --format onnx --out model.onnx
trtexec --onnx=model.onnx --saveEngine=model.plan --fp16
trtexec --loadEngine=model.plan --shapes=input:1x3x224x224

GPU backend terms (data-center reference)

These terms describe the NVIDIA end of the ONNX → TensorRT toolchain. They show up when the same engine builder is targeted at data-center silicon instead of an edge SoC; the bandwidth numbers below are for context, not edge devices.

NVVM
NVIDIA's LLVM-based device compiler IR. It feeds PTX generation for CUDA device code.
PTX
Virtual ISA for NVIDIA GPUs. Not final machine code; the driver lowers it to SASS at load time.
SASS
Final GPU machine code for one compute capability (sm_80 Ampere, sm_90 Hopper, sm_100 Blackwell). Inspect with cuobjdump --dump-sass.
HBM
Stacked DRAM mounted next to the GPU die. About 3.4 TB/s on H100, 4.8 TB/s on H200, 8 TB/s on B200 with HBM3e. Data-center class; an edge SoC sees LPDDR at ~50–200 GB/s instead, which is exactly why the edge compiler works so hard on bandwidth.

Edge SoC reference model

The page targets a generic edge SoC, not a named board. The diagram below abstracts the parts an ML compiler has to reason about: a few CPU cores for the host program, an NPU or small GPU for tensor compute, a slab of on-chip SRAM that the compiler tiles into, and LPDDR for everything that does not fit. On narrow screens, scroll the diagram horizontally.

Generic edge SoC (single package) CPU cluster (ARM Cortex) host program, pre / post, tokenizer, control flow 4–8 cores, ~10–30 GFLOPS/core NPU / small GPU MAC array, INT8 / FP16 matmul + conv accelerator 10–100 INT8 TOPS On-chip SRAM L2 + scratchpad compiler tiles into this ~1–10 MB, ~1 TB/s ISP / camera image in (CNN input) DMA into SoC SRAM zero-copy when possible LPDDR (off-chip DRAM) model weights, KV cache, large activations, runtime state ~50–200 GB/s. Typically 5–20× slower than on-chip SRAM and an order of magnitude more energy per access. Every round-trip here is the latency and the joules the compiler is trying to avoid.

CNN is compute-bound

A conv stack spends most of its time multiplying. The NPU is the bottleneck, not LPDDR; weights and activations are small enough to stage through on-chip SRAM. The compiler keeps the MAC array full: INT8 quantization, channels-last layout, large fused conv tiles.

LLM decode is memory-bound

During token-by-token decode, each step reads all the weights and the KV cache from LPDDR to produce one token. The NPU sits idle waiting on memory. The compiler targets bandwidth: INT4 weight-only quant, paged KV cache layout, fused attention.

CNN path: image → PyTorch → ONNX → TensorRT → edge

Image to engine, seven stages. Each stage below is clickable.

1. Image input

The model sees a [N, C, H, W] tensor. NCHW is the PyTorch default; NHWC is preferred by many NPUs and gets requested via a layout pass downstream. Preprocessing (resize, mean and std normalization, BGR↔RGB) can stay on the CPU or be folded into the graph as Resize + Sub + Mul ops so the NPU runs end-to-end from the ISP buffer.

x = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
x = (x - mean) / std            # often folded into the ONNX graph
x = x.unsqueeze(0)               # [1, 3, 224, 224]

2. PyTorch → ONNX export

Trace the nn.Module with example inputs, lower aten ops to ONNX nodes, and pick an opset_version high enough for the ops you use. Resize cleaned up in opset 11; dynamic Slice in opset 10; opset 17 is a safe default for current PyTorch. Mark dynamic_axes for any batch or spatial dimension you want flexible. Python control flow and custom ops are the two recurring export hazards.

torch.onnx.export(
    model, x, "model.onnx",
    opset_version=17,
    dynamic_axes={"input": {0: "N"}, "output": {0: "N"}},
    do_constant_folding=True,
)

3. Graph optimization

Pre-build passes on the ONNX graph. Constant folding evaluates weight-only subgraphs at compile time. BatchNorm folds into the preceding Conv: the BN's affine collapses into Conv's weight and bias, so one fewer kernel ships. Identity and Dropout get removed. Layout transform rewrites the graph to NHWC when the NPU prefers channels-last. onnxsim or TensorRT's built-in optimizer does all of this.

python -m onnxsim model.onnx model.sim.onnx
# Conv(W, b) -> BN(gamma, beta, mean, var) -> ReLU
# becomes Conv(W', b') -> ReLU, where
#   s  = gamma / sqrt(var + eps)
#   W' = W * s
#   b' = (b - mean) * s + beta

4. INT8 post-training quantization (PTQ)

Feed a small calibration set (100–500 unlabeled images) through the FP32 graph, collect activation histograms per tensor, and pick a scale and zero-point that minimize KL divergence against the float distribution. Weights are quantized per-channel, one scale per output filter; activations per-tensor. INT8 holds up on conv stacks because per-channel weight scales and ReLU-bounded activations keep the dynamic range tight in each tensor. First conv and the final classifier are usually the sensitive layers; pin them to FP16 with a precision constraint.

# First pass: build a calibration cache from images in ./calib/
polygraphy convert model.sim.onnx --int8 \
    --data-loader-script calib_loader.py \
    --calibration-cache calib.cache \
    -o /dev/null

# Second pass: build the deployment engine, consuming the cache
trtexec --onnx=model.sim.onnx --int8 \
        --calib=calib.cache --saveEngine=model.int8.plan

5. Kernel fusion

The builder walks the graph and replaces matchable subgraphs with single fused kernels: Conv + Bias + ReLU, Conv + Add (residual), depthwise + pointwise (mobilenet block). Intermediates stay in registers / shared memory instead of round-tripping to LPDDR. The TensorRT verbose log shows [LayerFusion] lines listing which patterns matched.

6. TensorRT engine build

For each layer the builder times multiple tactics (different kernel implementations) on the actual hardware at the actual shape and precision, picks the winner, and serialises the choices into a .plan file. The plan is locked to one GPU / NPU SKU and one set of input shapes (unless you set a dynamic shape profile). This step is slow (minutes) but only happens once per deployment target.

7. Runtime on the edge SoC

Deserialize the plan, allocate I/O bindings in SoC SRAM where possible, set up a CUDA / vendor stream, and DMA the camera buffer straight into the input binding (zero-copy when the ISP shares the same physical memory). One enqueueV3 per frame; the NPU runs the fused kernel sequence; output comes back as detections / class scores.

// C++ pseudocode, edge inference loop
context->setTensorAddress("input",  d_input);
context->setTensorAddress("output", d_output);
while (capture(frame)) {
    dma(frame, d_input);
    context->enqueueV3(stream);
    cudaStreamSynchronize(stream);
    postprocess(d_output);
}

LLM path: tokens → PyTorch → GGUF / TensorRT-LLM → edge

Same seven-stage shape, different stack. ONNX rarely shows up here. LLM graphs are large, the shapes are dynamic, and the KV cache makes the runtime stateful, none of which ONNX was built around. The two practical paths are HF safetensors → GGUF for CPU and NPU edge via llama.cpp, and HF safetensors → TensorRT-LLM engine for GPU edge.

1. Tokens in

The CPU tokenizer (BPE / SentencePiece) turns prompt text into a [1, S] int32 array. The runtime handles two regimes from the same engine: prefill processes the whole prompt in one big matmul-rich pass (compute-bound, NPU-friendly); decode generates one token at a time and is dominated by reading weights and KV cache (memory-bound). At the edge, batch is almost always 1.

2. HF safetensors → GGUF or TensorRT-LLM

Two real paths. GGUF (llama.cpp): convert-hf-to-gguf.py model/ packs all weights + tokenizer + config into one file that the runtime mmaps; backend (CPU AVX / Metal / Vulkan / CUDA) is chosen at load time. TensorRT-LLM: a Python builder script defines the model in TRT-LLM's Python API, calls trtllm-build with model + max sequence / batch / KV-cache config, and emits an engine. ONNX is skipped: graph too dynamic, attention too custom.

# Path A: GGUF for CPU / NPU edge
python convert-hf-to-gguf.py meta-llama/Llama-3.2-1B --outfile model.gguf
llama-quantize model.gguf model.q4_k_m.gguf q4_k_m

# Path B: TensorRT-LLM for GPU edge
trtllm-build --checkpoint_dir ./hf_ckpt \
             --output_dir ./engine \
             --max_input_len 2048 --max_seq_len 4096 \
             --gemm_plugin float16 --use_paged_context_fmha enable

3. INT4 weight-only quantization

Decode is bandwidth-starved, so weights drop to INT4 or INT8 while activations stay FP16. GPTQ minimizes per-layer reconstruction error using calibration data and an approximate second-order Hessian (OBS-style). AWQ identifies the small fraction of weight channels that the activation outliers route through and scales them up before quantization, with an inverse scale on the activations so the math cancels out. Group size 64 or 128 sets the granularity (one scale per group). 4× fewer bytes per weight read from LPDDR. On Llama-class chat models at INT4 group 128, perplexity delta is typically under one point on common evals.

# AWQ via llm-awq, then export to TRT-LLM
python -m awq.entry --model_path meta-llama/Llama-3.2-1B \
                    --w_bit 4 --q_group_size 128 \
                    --run_awq --dump_awq awq.pt

4. KV cache layout

Each attention layer caches K and V tensors of shape [num_heads, head_dim] for every token generated so far. Total size is 2 × n_layers × n_heads × head_dim × seq_len × bytes_per_elem, often hundreds of MB at long context and the largest single LPDDR resident on the edge SoC. The compiler picks paged KV: allocate in fixed-size blocks, reference each layer's cache through a block table. The pattern originated in vLLM and is now the default in TRT-LLM. Sliding-window attention caps the cache at the last N tokens for models trained that way (Mistral-style).

5. Attention fusion

The attention block collapses into a small number of fused kernels. Q, K, and V projections fuse into one packed matmul against the concatenated weights. Fused multi-head attention (FlashAttention-style) computes softmax(QKT)V in tiled passes without ever materializing the full S×S attention matrix. RMSNorm + matmul fuses so the normalization output stays in registers for the next projection. Rotary position applies inline to Q and K before the dot product. Each fusion replaces one LPDDR write-then-read pair with a register hop. On decode that translates directly into tokens/sec.

6. Engine / GGUF build

The builder produces the deployable artifact. With llama.cpp the GGUF file is the engine; the runtime ships the kernels and picks a backend at startup. With TRT-LLM, trtllm-build times tactics per layer at the chosen shapes, sets the KV cache config, and emits a serialized engine plus a runtime config JSON. Like CNN engines, a TRT-LLM plan is bound to one GPU SKU.

7. Prefill / decode loop on the edge SoC

Two regimes share one engine. Prefill runs one forward pass over the entire prompt, populating the KV cache. It is matmul-heavy and pegs the NPU. Decode generates one token at a time. Each step reads all the weights plus the existing KV cache and appends one row, which pegs LPDDR bandwidth and leaves the NPU mostly idle. Continuous batching is off at batch=1 edge, so the one knob that moves tokens/sec is memory bandwidth utilization. Stages 3 through 5 exist to lift it.

// llama.cpp pseudocode at the edge (current API)
llama_model  *m   = llama_model_load_from_file("model.q4_k_m.gguf", mparams);
llama_context*ctx = llama_init_from_model(m, cparams);

llama_batch batch = llama_batch_init(/*n_tokens*/ 512, /*embd*/ 0, /*n_seq_max*/ 1);

// prefill the prompt
for (int i = 0; i < n_prompt; ++i) {
    llama_batch_add(batch, prompt_tokens[i], /*pos*/ i,
                    /*seq_ids*/ {0}, /*logits*/ i == n_prompt - 1);
}
llama_decode(ctx, batch);

// decode one token at a time
while (n_generated < max_tokens) {
    int id = llama_sampler_sample(sampler, ctx, -1);
    if (llama_vocab_is_eog(llama_model_get_vocab(m), id)) break;
    llama_batch_clear(batch);
    llama_batch_add(batch, id, n_prompt + n_generated, {0}, true);
    llama_decode(ctx, batch);
    emit(id);
}

Fusion catalog: CNN and LLM patterns

Kernel fusion is a graph-level optimization, not a C/C++ loop pass. It cuts tensor round-trips to LPDDR and removes per-op launch overhead. On a memory-starved edge SoC, those are the two costs you can actually move.

What changes when kernels are fused

  • Unfused: each op writes a full intermediate tensor to global memory.
  • Fused: adjacent ops execute in one kernel while intermediates stay near compute.
  • Typical gains: fewer launches, less bandwidth pressure, lower end-to-end latency.

Practical examples: Conv+BN+ReLU, MatMul+Bias+GELU, layernorm patterns.

Why this belongs to ML compilation

Graph compilers and runtime builders (XLA, TensorRT, Torch-Inductor) make fusion decisions based on tensor shapes, layout constraints, and backend legality.

# conceptual unfused path
T1 = Conv(X, W)
T2 = BatchNorm(T1)
Y  = ReLU(T2)

# conceptual fused path
Y = FusedConvBnRelu(X, W, gamma, beta)

CNN fusion patterns

  • Conv + Bias + ReLU. The canonical one; activation collapses into the conv epilogue.
  • Conv + Add. Residual branch added inside the conv kernel; ResNet skip connection lives in registers.
  • Conv + BatchNorm folded at compile time (weights rewritten); the fused kernel only sees Conv + Bias.
  • Depthwise + Pointwise (MobileNet/EfficientNet block) fused so the depthwise output never spills to LPDDR.
# MobileNet block (logical)
y = DW3x3(x)     ; y = BN(y)  ; y = ReLU6(y)
z = PW1x1(y)     ; z = BN(z)

# compiled (one kernel)
z = FusedDwPwBnReLU6(x, Wdw, Wpw, params)

LLM fusion patterns

  • QKV projection fused into one matmul against a packed weight tensor. The input activation is read from LPDDR once instead of three times.
  • Fused multi-head attention (FlashAttention-style). Softmax(QKT)V computed in tiles, so the full N×N attention matrix never lives in LPDDR.
  • RMSNorm + matmul. Normalization output stays in registers for the following Q/K/V or FFN projection.
  • Rotary position + QK. Rope applied inline before the dot product, no separate rope kernel.
# Llama-style block (logical)
h = RMSNorm(x)
q,k,v = Wq h, Wk h, Wv h         ; rope(q,k)
a = softmax(q @ k.T / sqrt(d)) @ v
o = Wo a

# compiled (two kernels)
qkv     = FusedRmsNormQKVRope(x, Wqkv, gamma)
o       = FusedFlashAttn(qkv, Wo)

Why this is worth doing at the edge. An LPDDR access costs roughly an order of magnitude more energy than a same-size SRAM access, and 5–20× more latency. A modern inference runs tens of millions of ops per second, so any fusion that turns even a fraction of those LPDDR trips into register or SRAM hops shows up directly in the latency and joules-per-inference budget.

Unfused graph-to-kernel path Kernel 1: Conv write tensor T1 to HBM Kernel 2/3: BN + ReLU read T1, write T2, read T2 extra global memory traffic three launches ↘ 3 kernel launches, 3× HBM round-trips
Fused graph-to-kernel path Single fused kernel: Conv + BN + ReLU intermediates stay local (register/shared) one final output write to HBM fewer launches and lower bandwidth cost ↳ 1 launch, 1 HBM write, stays in registers / shared