Compute

OaComputeGraph

CPU-side DAG that tracks per-buffer read/write dependencies, inserts minimal barriers, and supports compile-once replay-many execution.

Dynamic Execution

For one-shot or changing topologies:

Dynamic.cpp

OaComputeGraph graph;
graph.Add("rmsnorm", bufs1, access1, &push1, sizeof(push1), groups1);
graph.Add("matmul",  bufs2, access2, &push2, sizeof(push2), groups2);
graph.Add("silu",    bufs3, access3, &push3, sizeof(push3), groups3);
auto status = graph.Execute(rt);  // topo-sort, barrier insertion, submit, wait
graph.Reset();

Compile + Replay

For static topologies (ML training, inference) where the graph is identical every step. Compile once, replay thousands of times with zero CPU recording overhead.

Compiled.cpp

// Init: compile once
graph.Add("rmsnorm", bufs, access, &push, sizeof(push), groups);
// ... add all nodes ...
OA_RETURN_IF_ERROR(graph.Compile(rt));  // records secondary command buffer

// Hot path: replay every step (zero CPU overhead)
OA_RETURN_IF_ERROR(graph.Replay(rt));   // replays recorded command buffer

Measured Performance

Intel HD 620 iGPU (2017 laptop, 24 EUs):

Metric	Result
Replay speedup	2-4.56x vs Execute()
Memory aliasing	71-92% VRAM savings
Barrier elimination	60-70% fewer barriers
Per-replay cost	~17 us/dispatch

Memory Aliasing

The graph knows every buffer's exact lifetime (first-write to last-read). Buffers that don't overlap in time share the same device memory. Measured 71-92% VRAM savings on transient activations.