Compute
OaComputeGraph
CPU-side DAG that tracks per-buffer read/write dependencies, inserts minimal barriers, and supports compile-once replay-many execution.
Dynamic Execution
For one-shot or changing topologies:
Dynamic.cpp
OaComputeGraph graph;graph.Add("rmsnorm", bufs1, access1, &push1, sizeof(push1), groups1);graph.Add("matmul", bufs2, access2, &push2, sizeof(push2), groups2);graph.Add("silu", bufs3, access3, &push3, sizeof(push3), groups3);auto status = graph.Execute(rt); // topo-sort, barrier insertion, submit, waitgraph.Reset();
Compile + Replay
For static topologies (ML training, inference) where the graph is identical every step. Compile once, replay thousands of times with zero CPU recording overhead.
Compiled.cpp
// Init: compile oncegraph.Add("rmsnorm", bufs, access, &push, sizeof(push), groups);// ... add all nodes ...OA_RETURN_IF_ERROR(graph.Compile(rt)); // records secondary command buffer// Hot path: replay every step (zero CPU overhead)OA_RETURN_IF_ERROR(graph.Replay(rt)); // replays recorded command buffer
Measured Performance
Intel HD 620 iGPU (2017 laptop, 24 EUs):
| Metric | Result |
|---|---|
| Replay speedup | 2-4.56x vs Execute() |
| Memory aliasing | 71-92% VRAM savings |
| Barrier elimination | 60-70% fewer barriers |
| Per-replay cost | ~17 us/dispatch |
Memory Aliasing
The graph knows every buffer's exact lifetime (first-write to last-read). Buffers that don't overlap in time share the same device memory. Measured 71-92% VRAM savings on transient activations.