Compute

OaEngine

The assembled compute context. One object owns the GPU device, memory allocator, pipeline registry, bindless descriptor heap, and stream pool. Create it once. Everything flows through it.

Create

Every binary starts by creating OaEngine. It auto-registers as the process-wide global context. OaTensor and OaModuledispatch to GPU immediately — no manual setup.

Main.cpp

#include <oa/runtime/engine.h>
int main(int argc, char** argv) {
auto rt = OaEngine::Create({.AppName = "MyApp"}).Unwrap();
// Global context is set — tensors and modules dispatch to GPU.
// Configure shader search paths (debug only):
rt.AddShaderSearchPath("spirv");
RunApp(rt, argc, argv);
rt.Destroy();
}

What It Owns

ComponentPurpose
OaDevicePhysical + logical GPU device, SAM detection
OaAllocatorGPU memory allocator
OaPipelineRegistryCompute pipeline cache, shared_mutex protected
OaBindlessHeap64K-slot global descriptor set
Stream poolPersistent async compute streams (free-list stack)

Compute Streams

Each stream owns a persistent command pool, command buffer, and timeline semaphore. The engine manages a pool of streams via AcquireStream / ReleaseStream.

Streams.cpp

// Batched dispatch
auto* stream = rt.AcquireStream();
stream->Begin();
stream->Record(rt, "rmsnorm", bufs1, &push1, sizeof(push1), groups1);
stream->Record(rt, "silu", bufs2, &push2, sizeof(push2), groups2);
stream->SubmitAndWait(rt);
rt.ReleaseStream(stream);

Synchronization

Timeline semaphores for all stream sync. Automatic pipeline barriers with minimal insertion (only where read-after-write hazards exist). Queue submissions serialized via std::mutex.