Two tracks at equal depth. Host compilation for AArch64 codegen: loop passes, vectorization, GCC flag behavior. ML compilation for the path from a PyTorch CNN or LLM to a deployable artifact on a generic edge SoC.
Vectorization, loop passes, pointer aliasing, then GCC flags.
Writing code the auto-vectorizer can use
NEON and SVE prose covers what the hardware can do. This section covers what the source code has to look like for -O3 to actually emit a vectorized loop.
restrict on every pointer the loop touches. Removes the may-alias gate. See the aliasing section for the exact contract.
Alignment hints.alignas(16) or alignas(32) on the buffer, plus __builtin_assume_aligned(p, 16) at the loop entry. Lets the compiler pick aligned ld1 over unaligned ldur.
Vectorization pragmas.#pragma omp simd needs -fopenmp-simd (or -fopenmp) on GCC and Clang to take effect; without the flag the pragma is silently ignored. #pragma clang loop vectorize(enable) works without flags on Clang.
Contiguous unit-stride access. Strided or gather patterns force scalar fallback or expensive gather instructions.
Trip-count hints. Peel by a multiple of the vector width with if (n % 8 == 0) or __builtin_assume(n % 8 == 0) so the vectorizer skips the scalar tail.
What blocks vectorization: function calls inside the loop (unless the callee is __attribute__((const))), early break/return, dependent reductions written without reduction(+:x).
Before: scalar emit
C input:
void axpy(float *a, float *b, float *c,
float k, int n) {
for (int i = 0; i < n; ++i)
c[i] = a[i] * k + b[i];
}
Three plain pointers, no alignment guarantee, no pragma. The compiler assumes possible aliasing and emits scalar fmadd per iteration.
After: 4-lane SIMD emit
C input:
void axpy(float * restrict a,
float * restrict b,
float * restrict c,
float k, int n) {
a = __builtin_assume_aligned(a, 16);
b = __builtin_assume_aligned(b, 16);
c = __builtin_assume_aligned(c, 16);
#pragma omp simd
for (int i = 0; i < n; ++i)
c[i] = a[i] * k + b[i];
}
k arrives in s0 per AAPCS64. The compiler broadcasts it once with dup v3.4s, v0.s[0], then the body runs ld1 {v0.4s}, ld1 {v1.4s}, fmla v1.4s, v0.4s, v3.4s, st1 {v1.4s} over four float lanes per iteration, with a peeled scalar tail for the remainder.
C/C++ loop optimization (LICM + DCE)
Focus on which loop instructions disappeared and why.
Invariant code motion in plain terms
If a value doesn't change inside a loop, compute it once before the loop. Saves instructions every iteration and opens scheduling room.
// before LICM
for (int i = 0; i < n; ++i) {
int stride = cols * 4; // invariant but recomputed
dst[i] = src[i * stride] + bias;
}
// after LICM
int stride = cols * 4;
for (int i = 0; i < n; ++i) {
dst[i] = src[i * stride] + bias;
}
On memory-bound loops it's no miracle, but it removes useless ALU work and shrinks the hot loop body.
DCE in practical terms
Dead Code Elimination removes computations whose results are never used and have no side effects.
What is removed: unused temporaries, redundant arithmetic chains, dead stores.
Why it helps: smaller loop body, less scheduler pressure, clearer vectorization opportunities.
Developer signal: cleaner post-pass IR usually means cleaner assembly too.
// dead path example
int t0 = x * 4;
int t1 = t0 + 1; // removed if never consumed
sum += x;
Stage
IR snapshot
What changed
Input IR
%k = mul n, 4 loop: %v = load A[i] %u = add %v, c %dead = add 0, 0
Invariant %k sits in the hot loop and %dead has no consumer.
After LICM
preheader: %k = mul n, 4 loop: %v = load A[i] %u = add %v, c %dead = add 0, 0
Invariant multiply moved out of loop.
The multiply now runs once instead of trip_count times.
After DCE
preheader: %k = mul n, 4 loop: %v = load A[i] %u = add %v, c
Unused computation removed. Final loop body is smaller, so schedulers and vectorizers have less clutter.
The preheader runs once before the loop. LICM hoists invariant math there so the loop body stops repeating it.
A dead instruction burns cycles without changing the output. After DCE the loop body is tighter and the scheduler has more room.
Pointer aliasing and restrict
When two pointer parameters might overlap, the compiler reloads after every store and cannot vectorize. restrict (C99) and __restrict__ (GCC and Clang C++) tell the compiler that this pointer is the only path the function uses to reach that memory. Loads can then stay in registers across stores, hoist out of loops, and fold into SIMD lanes. Violating the contract is undefined behavior, and the compiler does not check.
Without restrict
C input:
void add(int *a,
int *b,
int *c, int n) {
for (int i = 0; i < n; ++i)
a[i] = b[i] + c[i];
}
The compiler must assume a may alias b or c. A store to a[i] could change a future b or c read, so loads cannot be hoisted, kept in registers, or fused into a SIMD load. Auto-vectorization gives up.
With restrict
C input:
void add(int * restrict a,
const int * restrict b,
const int * restrict c, int n) {
for (int i = 0; i < n; ++i)
a[i] = b[i] + c[i];
}
b and c loads fold into a single SIMD ld1 per iteration block, the store does not invalidate the register cache, and the loop emits four-lane add + st1 on AArch64.
Same rule applies in C++ on hot loops over arrays of POD: mark every pointer the loop reads or writes with __restrict__. The optimizer cannot infer it from const alone.
GCC/G++ flags: C → AArch64 codegen diffs
Each card pairs a C snippet with its AArch64 codegen and the optimization delta. Trust the emitted shape; flag names are only hints. Assembly is representative. Exact registers and tail-cleanup vary by compiler version.
AArch64 quick legend
sN vs vN.4s:sN is one float32 scalar; vN.4s is a 4-lane float32 vector.
ld1/st1 with #16: load/store 16 bytes and post-increment the pointer by 16.
cmp + b.ne: compare and branch while not equal.
subs + b.gt: subtract and branch while remaining count is positive.
whilelt p0.s, x3, x2 + b.none: build an SVE lane mask for valid elements; exit when no lanes are active.
fmadd/fmla: fused multiply-add. One rounded FP op.
1) O-levels on a typical C loop
-O0 (debug-first shape)
C input:
float sum_debug(const float* a, int n) {
float s = 0.f;
for (int i = 0; i < n; ++i) s += a[i];
return s;
}
x0: base pointer a. x3: loop index. s0: temporary scalar load. x29/x30: frame pointer and return address.
Line-by-line
stp x29, x30 and mov x29, sp create a full debug-friendly frame. ldr s0 fetches one float. str/ldr spill+reload shows lower optimization at -O0. b .Lloop jumps back to iterate.
Optimization delta: minimal transform pressure and larger loop body, which helps source-level debug but hurts throughput.
-O2 (release baseline)
C input:
void saxpy_o2(float* y, const float* x, float a, int n) {
for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i];
}
x0: pointer y. x1: pointer x. w2: n. w3: loop index i. s0 initially carries argument a, then s2 keeps that scalar for the loop. s0/s1 are reused for per-element values.
Line-by-line
fmov s2, s0 preserves incoming scalar a. ldr s0 loads x[i] and ldr s1 loads y[i]. fmadd s1, s0, s2, s1 computes x[i] * a + y[i] in one step. str s1 stores updated y[i]. cmp + b.ne runs the loop until i == n.
Optimization delta: tighter scalar loop and better scheduling around load/use latency, with predictable code size.
-O3 (aggressive loop/vector transforms)
C input:
void relu_o3(float* dst, const float* src, int n) {
for (int i = 0; i < n; ++i) dst[i] = src[i] > 0.f ? src[i] : 0.f;
}
x1: source pointer. x0: destination pointer. x2: remaining element count. v0.4s: 4-lane float chunk of src. v2.4s: vector filled with zeros.
Line-by-line
movi v2.4s, #0 builds a zero vector. ld1 {v0.4s}, [x1], #16 loads 4 floats and moves x1 forward by 16 bytes. fmax applies ReLU by clamping negative lanes to zero. st1 {v0.4s}, [x0], #16 stores 4 outputs and advances x0. subs x2, x2, #4 consumes 4 elements. b.gt .Lvec repeats this vector body while chunks remain; a separate tail path handles n % 4.
Optimization delta: auto-vectorized body reduces loop overhead; gains depend on contiguous accesses and cache residency.
-Ofast / contraction-enabled FP math
C input:
float affine_fast(float x, float a, float b) {
return a * x + b;
}
How this can change results: when contraction is enabled (for example -ffp-contract=fast or fast-math policy), the compiler may emit hardware FMADD. Fused execution rounds once at the end. The non-contracted path uses two instructions and rounds after each step.
Register map
s0: x on entry and return value on exit. s1: a. s2: b. s3: temporary product in strict mode.
Line-by-line
Non-contracted path computes product first (fmul) and then adds bias (fadd), so there are two floating-point operations and two rounding points. Contracted path uses one fused instruction (fmadd), so rounding happens once at the end. Throughput can improve, but exact numeric bits can change.
Optimization delta: contraction and reassociation can improve hot-loop throughput, but numeric behavior can shift. Validate before using in control or safety firmware paths.
2) Cross-file and profile-guided optimizations
-flto (link-time optimization)
C input:
// tu_a.c
float saxpy_step(float y, float x, float a) { return a * x + y; }
// tu_b.c
for (int i = 0; i < n; ++i) y[i] = saxpy_step(y[i], x[i], a);
x0: y pointer. x1: x pointer. x3: index. s2: scalar a. s0/s1: loaded elements.
Line-by-line
bl saxpy_step is the non-LTO path that keeps a function call in the loop. With LTO, the call is replaced by inlined loads and fmadd, so the loop body avoids call overhead and exposes more scheduling room.
Optimization delta: inlines across translation units, removes call overhead, and exposes more scheduling room.
-fprofile-generate / -fprofile-use (PGO)
C input:
int decode_tag(unsigned v) {
if (v < 240) return fast_table[v];
return slow_path(v);
}
AArch64 output:
cmp w0, #239
b.hi .Lcold
// hot path falls through
ldr w1, [x2, x0, lsl #2]
b .Ldone
.Lcold:
bl slow_path
Register map
w0: input v. x2: fast_table base pointer. w1: loaded fast-path value. .Lcold: cold branch label.
Line-by-line
cmp w0, #239 checks fast-path range. b.hi .Lcold jumps only when value exceeds hot threshold. Otherwise execution falls through to ldr from lookup table. bl slow_path is kept on the cold path so common traffic stays on a shorter branch path.
Optimization delta: branch layout follows recorded profile data so the hot path is shorter in front-end cycles.
3) CPU targeting on ARM
-mcpu / -mtune
C input:
void mac_core(float* y, const float* x, const float* k, int n) {
for (int i = 0; i < n; ++i) y[i] += x[i] * k[i];
}
x1: pointer x. x2: pointer k. x0: pointer y. v0.4s/v1.4s: 4-element vectors from x/k. v2.4s: current y values.
Line-by-line
Two ld1 instructions fetch 4 floats from x and k. ld1 {v2.4s} loads current output chunk. fmla performs element-wise multiply-add on four lanes. st1 stores updated chunk and advances y pointer by 16 bytes.
Optimization delta: the compiler changes unroll and scheduling strategy to match the selected core model.
-march (ISA enablement, SVE example)
C input:
void axpy_sve(float* y, const float* x, float a, int n) {
for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i];
}
x3: current index. x2: total count n. p0: active-lane predicate mask. w4: scalar a bit pattern copied from s0. z0: x vector. z1: y vector. z2: broadcast scalar a.
Line-by-line
fmov copies scalar a from FP argument register s0 into a GP register, and dup broadcasts it into z2.s. whilelt creates the active-lane mask for elements where i < n. b.none exits when no lanes remain. ld1w ... p0/z loads only active lanes. fmla ... p0/m updates only active lanes. st1w ... p0 stores active lanes only. incw advances by one SVE vector chunk, then control returns to rebuild the predicate.
Optimization delta: enabling ISA features changes the instruction family available to codegen, not just micro-tuning.
4) Frame pointer policy and debug reliability
Default omit-frame-pointer path
C input:
int checksum_fast(const int* p, int n) {
int s = 0;
for (int i = 0; i < n; ++i) s += p[i];
return s;
}
AArch64 output:
sub sp, sp, #32
...
add sp, sp, #32
ret
Register map
sp: stack pointer only. x29 frame pointer is not used as a stable chain in this shape.
Line-by-line
sub sp reserves stack space. Work happens in the omitted body. add sp releases stack space. ret returns. This is compact. With unwind metadata (for example DWARF CFI), many tools can still unwind correctly, but frame-pointer chains are often easier for low-overhead sampling and field triage.
Optimization delta: frees one register and trims prologue cost. Trade-off: some profilers and on-device crash workflows produce less stable stacks without frame pointers.
-fno-omit-frame-pointer
C input:
int checksum_traceable(const int* p, int n) {
int s = 0;
for (int i = 0; i < n; ++i) s += p[i];
return s;
}
stp x29, x30 saves previous frame pointer and return address while allocating stack. mov x29, sp creates a stable frame anchor. ldp x29, x30 restores both on exit. ret then returns with an explicit frame chain.
Optimization delta: more predictable stacks for profilers and postmortem triage, with a small runtime and register cost in very hot loops.
Host C/C++ ends here. ML compilation for CNN and LLM workloads on a generic edge SoC starts below.
ML Compilation Track: CNN and LLM to Edge SoC
Same compiler ideas, different IR. A conv stack is compute-bound on the NPU; LLM decode is memory-bound on LPDDR. The workload sets the bottleneck, and the bottleneck sets which passes earn their keep.
ML compilation overview
ML compilation lowers a framework graph (PyTorch / TF / JAX) through a portable IR, applies graph-level rewrites (constant folding, layout transform, quantization, fusion), then emits a hardware-specific artifact for an edge SoC. The deployment lane below (ONNX → TensorRT) and the compiler lane (XLA / HLO / MLIR) are the two shapes this takes in practice.
Engine plan (TensorRT), GGUF blob (llama.cpp), .vmfb (IREE)
Dominant cost
Compute throughput, branch prediction
Memory movement, especially LPDDR ↔ SoC
ONNX block
ONNX is the interchange checkpoint. It keeps model semantics but not final kernel scheduling decisions.
TensorRT builder block
This is where kernel tactics, precision choices, and legal fusion are selected for the target GPU.
HLO canonicalization block
HLO rewrites expression forms into a normalized shape so later passes can match and optimize them reliably.
Fusion and tiling block
This is the high-impact stage for bandwidth. It decides how much intermediate tensor traffic can be eliminated.
MLIR
MLIR is a compiler infrastructure built around dialects (Linalg, Affine, Vector, GPU, LLVM). Pipelines progressively lower from high-level dialects down to a backend. StableHLO (XLA), IREE, and Torch-MLIR are all MLIR-based stacks.
XLA / HLO
XLA lowers TF/JAX into HLO. StableHLO is the portable interchange dialect for OpenXLA. Pass order: shape inference → canonicalization → fusion → layout assignment → backend lowering. Goal: preserve math, cut memory traffic and kernel count.
ONNX + TensorRT
ONNX is both a graph format and a runtime (onnxruntime). In the deployment lane it's the graph hand-off into TensorRT, which parses, picks tactics, applies legal fusions, and emits a serialized engine for the target GPU and precision profile.
These terms describe the NVIDIA end of the ONNX → TensorRT toolchain. They show up when the same engine builder is targeted at data-center silicon instead of an edge SoC; the bandwidth numbers below are for context, not edge devices.
NVVM
NVIDIA's LLVM-based device compiler IR. It feeds PTX generation for CUDA device code.
PTX
Virtual ISA for NVIDIA GPUs. Not final machine code; the driver lowers it to SASS at load time.
SASS
Final GPU machine code for one compute capability (sm_80 Ampere, sm_90 Hopper, sm_100 Blackwell). Inspect with cuobjdump --dump-sass.
HBM
Stacked DRAM mounted next to the GPU die. About 3.4 TB/s on H100, 4.8 TB/s on H200, 8 TB/s on B200 with HBM3e. Data-center class; an edge SoC sees LPDDR at ~50–200 GB/s instead, which is exactly why the edge compiler works so hard on bandwidth.
Edge SoC reference model
The page targets a generic edge SoC, not a named board. The diagram below abstracts the parts an ML compiler has to reason about: a few CPU cores for the host program, an NPU or small GPU for tensor compute, a slab of on-chip SRAM that the compiler tiles into, and LPDDR for everything that does not fit. On narrow screens, scroll the diagram horizontally.
CNN is compute-bound
A conv stack spends most of its time multiplying. The NPU is the bottleneck, not LPDDR; weights and activations are small enough to stage through on-chip SRAM. The compiler keeps the MAC array full: INT8 quantization, channels-last layout, large fused conv tiles.
LLM decode is memory-bound
During token-by-token decode, each step reads all the weights and the KV cache from LPDDR to produce one token. The NPU sits idle waiting on memory. The compiler targets bandwidth: INT4 weight-only quant, paged KV cache layout, fused attention.
Image to engine, seven stages. Each stage below is clickable.
1. Image input
The model sees a [N, C, H, W] tensor. NCHW is the PyTorch default; NHWC is preferred by many NPUs and gets requested via a layout pass downstream. Preprocessing (resize, mean and std normalization, BGR↔RGB) can stay on the CPU or be folded into the graph as Resize + Sub + Mul ops so the NPU runs end-to-end from the ISP buffer.
x = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
x = (x - mean) / std # often folded into the ONNX graph
x = x.unsqueeze(0) # [1, 3, 224, 224]
2. PyTorch → ONNX export
Trace the nn.Module with example inputs, lower aten ops to ONNX nodes, and pick an opset_version high enough for the ops you use. Resize cleaned up in opset 11; dynamic Slice in opset 10; opset 17 is a safe default for current PyTorch. Mark dynamic_axes for any batch or spatial dimension you want flexible. Python control flow and custom ops are the two recurring export hazards.
Pre-build passes on the ONNX graph. Constant folding evaluates weight-only subgraphs at compile time. BatchNorm folds into the preceding Conv: the BN's affine collapses into Conv's weight and bias, so one fewer kernel ships. Identity and Dropout get removed. Layout transform rewrites the graph to NHWC when the NPU prefers channels-last. onnxsim or TensorRT's built-in optimizer does all of this.
python -m onnxsim model.onnx model.sim.onnx
# Conv(W, b) -> BN(gamma, beta, mean, var) -> ReLU
# becomes Conv(W', b') -> ReLU, where
# s = gamma / sqrt(var + eps)
# W' = W * s
# b' = (b - mean) * s + beta
4. INT8 post-training quantization (PTQ)
Feed a small calibration set (100–500 unlabeled images) through the FP32 graph, collect activation histograms per tensor, and pick a scale and zero-point that minimize KL divergence against the float distribution. Weights are quantized per-channel, one scale per output filter; activations per-tensor. INT8 holds up on conv stacks because per-channel weight scales and ReLU-bounded activations keep the dynamic range tight in each tensor. First conv and the final classifier are usually the sensitive layers; pin them to FP16 with a precision constraint.
# First pass: build a calibration cache from images in ./calib/
polygraphy convert model.sim.onnx --int8 \
--data-loader-script calib_loader.py \
--calibration-cache calib.cache \
-o /dev/null
# Second pass: build the deployment engine, consuming the cache
trtexec --onnx=model.sim.onnx --int8 \
--calib=calib.cache --saveEngine=model.int8.plan
5. Kernel fusion
The builder walks the graph and replaces matchable subgraphs with single fused kernels: Conv + Bias + ReLU, Conv + Add (residual), depthwise + pointwise (mobilenet block). Intermediates stay in registers / shared memory instead of round-tripping to LPDDR. The TensorRT verbose log shows [LayerFusion] lines listing which patterns matched.
6. TensorRT engine build
For each layer the builder times multiple tactics (different kernel implementations) on the actual hardware at the actual shape and precision, picks the winner, and serialises the choices into a .plan file. The plan is locked to one GPU / NPU SKU and one set of input shapes (unless you set a dynamic shape profile). This step is slow (minutes) but only happens once per deployment target.
7. Runtime on the edge SoC
Deserialize the plan, allocate I/O bindings in SoC SRAM where possible, set up a CUDA / vendor stream, and DMA the camera buffer straight into the input binding (zero-copy when the ISP shares the same physical memory). One enqueueV3 per frame; the NPU runs the fused kernel sequence; output comes back as detections / class scores.
// C++ pseudocode, edge inference loop
context->setTensorAddress("input", d_input);
context->setTensorAddress("output", d_output);
while (capture(frame)) {
dma(frame, d_input);
context->enqueueV3(stream);
cudaStreamSynchronize(stream);
postprocess(d_output);
}
Same seven-stage shape, different stack. ONNX rarely shows up here. LLM graphs are large, the shapes are dynamic, and the KV cache makes the runtime stateful, none of which ONNX was built around. The two practical paths are HF safetensors → GGUF for CPU and NPU edge via llama.cpp, and HF safetensors → TensorRT-LLM engine for GPU edge.
1. Tokens in
The CPU tokenizer (BPE / SentencePiece) turns prompt text into a [1, S] int32 array. The runtime handles two regimes from the same engine: prefill processes the whole prompt in one big matmul-rich pass (compute-bound, NPU-friendly); decode generates one token at a time and is dominated by reading weights and KV cache (memory-bound). At the edge, batch is almost always 1.
2. HF safetensors → GGUF or TensorRT-LLM
Two real paths. GGUF (llama.cpp):convert-hf-to-gguf.py model/ packs all weights + tokenizer + config into one file that the runtime mmaps; backend (CPU AVX / Metal / Vulkan / CUDA) is chosen at load time. TensorRT-LLM: a Python builder script defines the model in TRT-LLM's Python API, calls trtllm-build with model + max sequence / batch / KV-cache config, and emits an engine. ONNX is skipped: graph too dynamic, attention too custom.
Decode is bandwidth-starved, so weights drop to INT4 or INT8 while activations stay FP16. GPTQ minimizes per-layer reconstruction error using calibration data and an approximate second-order Hessian (OBS-style). AWQ identifies the small fraction of weight channels that the activation outliers route through and scales them up before quantization, with an inverse scale on the activations so the math cancels out. Group size 64 or 128 sets the granularity (one scale per group). 4× fewer bytes per weight read from LPDDR. On Llama-class chat models at INT4 group 128, perplexity delta is typically under one point on common evals.
# AWQ via llm-awq, then export to TRT-LLM
python -m awq.entry --model_path meta-llama/Llama-3.2-1B \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq.pt
4. KV cache layout
Each attention layer caches K and V tensors of shape [num_heads, head_dim] for every token generated so far. Total size is 2 × n_layers × n_heads × head_dim × seq_len × bytes_per_elem, often hundreds of MB at long context and the largest single LPDDR resident on the edge SoC. The compiler picks paged KV: allocate in fixed-size blocks, reference each layer's cache through a block table. The pattern originated in vLLM and is now the default in TRT-LLM. Sliding-window attention caps the cache at the last N tokens for models trained that way (Mistral-style).
5. Attention fusion
The attention block collapses into a small number of fused kernels. Q, K, and V projections fuse into one packed matmul against the concatenated weights. Fused multi-head attention (FlashAttention-style) computes softmax(QKT)V in tiled passes without ever materializing the full S×S attention matrix. RMSNorm + matmul fuses so the normalization output stays in registers for the next projection. Rotary position applies inline to Q and K before the dot product. Each fusion replaces one LPDDR write-then-read pair with a register hop. On decode that translates directly into tokens/sec.
6. Engine / GGUF build
The builder produces the deployable artifact. With llama.cpp the GGUF file is the engine; the runtime ships the kernels and picks a backend at startup. With TRT-LLM, trtllm-build times tactics per layer at the chosen shapes, sets the KV cache config, and emits a serialized engine plus a runtime config JSON. Like CNN engines, a TRT-LLM plan is bound to one GPU SKU.
7. Prefill / decode loop on the edge SoC
Two regimes share one engine. Prefill runs one forward pass over the entire prompt, populating the KV cache. It is matmul-heavy and pegs the NPU. Decode generates one token at a time. Each step reads all the weights plus the existing KV cache and appends one row, which pegs LPDDR bandwidth and leaves the NPU mostly idle. Continuous batching is off at batch=1 edge, so the one knob that moves tokens/sec is memory bandwidth utilization. Stages 3 through 5 exist to lift it.
// llama.cpp pseudocode at the edge (current API)
llama_model *m = llama_model_load_from_file("model.q4_k_m.gguf", mparams);
llama_context*ctx = llama_init_from_model(m, cparams);
llama_batch batch = llama_batch_init(/*n_tokens*/ 512, /*embd*/ 0, /*n_seq_max*/ 1);
// prefill the prompt
for (int i = 0; i < n_prompt; ++i) {
llama_batch_add(batch, prompt_tokens[i], /*pos*/ i,
/*seq_ids*/ {0}, /*logits*/ i == n_prompt - 1);
}
llama_decode(ctx, batch);
// decode one token at a time
while (n_generated < max_tokens) {
int id = llama_sampler_sample(sampler, ctx, -1);
if (llama_vocab_is_eog(llama_model_get_vocab(m), id)) break;
llama_batch_clear(batch);
llama_batch_add(batch, id, n_prompt + n_generated, {0}, true);
llama_decode(ctx, batch);
emit(id);
}
Fusion catalog: CNN and LLM patterns
Kernel fusion is a graph-level optimization, not a C/C++ loop pass. It cuts tensor round-trips to LPDDR and removes per-op launch overhead. On a memory-starved edge SoC, those are the two costs you can actually move.
What changes when kernels are fused
Unfused: each op writes a full intermediate tensor to global memory.
Fused: adjacent ops execute in one kernel while intermediates stay near compute.
Typical gains: fewer launches, less bandwidth pressure, lower end-to-end latency.
Graph compilers and runtime builders (XLA, TensorRT, Torch-Inductor) make fusion decisions based on tensor shapes, layout constraints, and backend legality.
Conv + Bias + ReLU. The canonical one; activation collapses into the conv epilogue.
Conv + Add. Residual branch added inside the conv kernel; ResNet skip connection lives in registers.
Conv + BatchNorm folded at compile time (weights rewritten); the fused kernel only sees Conv + Bias.
Depthwise + Pointwise (MobileNet/EfficientNet block) fused so the depthwise output never spills to LPDDR.
# MobileNet block (logical)
y = DW3x3(x) ; y = BN(y) ; y = ReLU6(y)
z = PW1x1(y) ; z = BN(z)
# compiled (one kernel)
z = FusedDwPwBnReLU6(x, Wdw, Wpw, params)
LLM fusion patterns
QKV projection fused into one matmul against a packed weight tensor. The input activation is read from LPDDR once instead of three times.
Fused multi-head attention (FlashAttention-style). Softmax(QKT)V computed in tiles, so the full N×N attention matrix never lives in LPDDR.
RMSNorm + matmul. Normalization output stays in registers for the following Q/K/V or FFN projection.
Rotary position + QK. Rope applied inline before the dot product, no separate rope kernel.
# Llama-style block (logical)
h = RMSNorm(x)
q,k,v = Wq h, Wk h, Wv h ; rope(q,k)
a = softmax(q @ k.T / sqrt(d)) @ v
o = Wo a
# compiled (two kernels)
qkv = FusedRmsNormQKVRope(x, Wqkv, gamma)
o = FusedFlashAttn(qkv, Wo)
Why this is worth doing at the edge. An LPDDR access costs roughly an order of magnitude more energy than a same-size SRAM access, and 5–20× more latency. A modern inference runs tens of millions of ops per second, so any fusion that turns even a fraction of those LPDDR trips into register or SRAM hops shows up directly in the latency and joules-per-inference budget.
Unfused blocks
Each block is a separate kernel boundary. Intermediate tensors are written and reread from global memory.
Fused block
Multiple graph ops are compiled into one kernel, minimizing intermediate memory traffic.
Why it matters
On large models, memory movement often dominates compute. Fusion attacks that bottleneck directly.
Where to inspect
Use compiler/runtime reports (XLA HLO dumps, TensorRT layer fusion logs) to verify fusion actually happened.