Architecture
OaAdapter — Model Interface
Two routes to build models. One interface to run them all. From PyTorch-style prototyping to zero-overhead production dispatch, the adapter bridges any architecture to any application.
Two Routes, One Stack
Every model in Realm can be built two ways. Both produce the same .oam checkpoint, run on the same GPU compute engine, and plug into the same train/chat/serve applications.
| Aspect | Easy Route (Level 1) | Production Route (Level 2) |
|---|---|---|
| API | OaTensor + nn.Module + autograd | OaModule::Dispatch + batched command buffers |
| Style | PyTorch / Keras | Manual GPU programming |
| Dispatch | One GPU submit per op | ~30 ops in one submit |
| Memory | Automatic (autograd tape) | All buffers allocated in Init, reused |
| Gradients | Automatic (loss.Backward()) | Manual backward shaders |
| Best for | Prototyping, research, small models | Production training, inference serving |
| Example | TinyLlm | OaLlm, OaGpt |
The Easy Route
Define an OaModule with layers from nn.h. Forward returns a tensor. Autograd handles the rest. If you know PyTorch, you know this.
Easy.cpp
// Level 1 — The Easy Route// Define layers, forward, backward, step. Just like PyTorch.class TinyLlm : public OaModule {public:TinyLlm(OaI32 D, OaI32 DFF, OaI32 NL) {Embed_ = OaMakeShared<OaEmbedding>(256, D);for (OaI32 i = 0; i < NL; ++i) {Norm_.push_back(OaMakeShared<OaRMSNorm>(D));Up_.push_back(OaMakeShared<OaLinear>(D, DFF));Down_.push_back(OaMakeShared<OaLinear>(DFF, D));}Head_ = OaMakeShared<OaLinear>(D, 256);}OaTensor Forward(const OaTensor& x) override {auto h = Embed_->Forward(x);for (OaI32 i = 0; i < (OaI32)Norm_.size(); ++i) {auto r = Norm_[i]->Forward(h);r = Up_[i]->Forward(r).SiLU();h = h + Down_[i]->Forward(r);}return Head_->Forward(h);}};// Training loop — 6 lines.auto rt = OaEngine::Create({.AppName = "Train"}).Unwrap();TinyLlm model(64, 256, 2);OaAdamW optimizer(model.Parameters(), 3e-4f);for (int step = 0; step < 1000; ++step) {auto loss = OaCrossEntropyLoss(model.Forward(input), target);loss.Backward();optimizer.Step();optimizer.ZeroGrad();}
The Production Route
For maximum throughput: allocate all GPU buffers once in Init(), batch all dispatches into a single command buffer, submit once per step. No per-op overhead. No autograd tape. Manual backward shaders.
Production.cpp
// Level 2 — The Production Route// Manual GPU dispatch. Buffers allocated once, reused every step.// ~30 dispatches batched into a single command buffer submission.struct OaLlm : public OaModule {OaStatus Init(OaEngine& rt, const OaLlmConfig& cfg) {LoadShaders(rt, shaderTable); // embedded SPIR-Vfor (auto& layer : Layers_)layer.Init(rt, cfg); // Alloc all buffersInitWeights(); // Fill via MappedPtrreturn OaStatus::Ok();}OaStatus Forward(OaEngine& rt) {BeginBatch(rt);Dispatch(rt, "byte_embed", ...);for (auto& layer : Layers_) {Dispatch(rt, "rmsnorm", ...);Dispatch(rt, "matmul", ...);Dispatch(rt, "silu", ...);// ... ~30 dispatches total}return FlushDispatches(rt); // one submit, one wait}OaStatus Backward(OaEngine& rt) { /* mirror of Forward */ }OaStatus Step(OaEngine& rt, OaI64 step, OaF32 lr) { /* AdamW */ }};
The Adapter
Both routes produce models that implement OaAdapterLm — a unified interface for train, generate, save, and load. The application layer never touches model internals. A factory creates the right adapter from an architecture name.
Adapter.h
// OaAdapter — The Bridge// Unified interface between app (train/chat) and any model architecture.struct OaAdapter {virtual OaStatus Init(OaEngine& rt) = 0;virtual void Destroy() = 0;virtual OaStringView GetArchitecture() const = 0;};struct OaAdapterLm : OaAdapter {virtual OaStatus Train(OaEngine& rt, const OaU8* data, ...) = 0;virtual OaStatus Generate(OaEngine& rt, ...) = 0;virtual OaStatus Save(const OaString& path) = 0;virtual OaStatus Load(OaEngine& rt, const OaString& path) = 0;};// Factory — architecture selection by name.OaUnique<OaAdapter> OaCreateAdapter(OaStringView arch);// Returns OaAdapterLlm for "OaLlm", OaAdapterGpt for "OaGpt", etc.
Plug and Play
The train and chat binaries are architecture-agnostic. Pass --arch OaLlm or --arch OaGpt and the factory wires everything up. Same binary, any model.
Train.cpp
// The app doesn't care which model is underneath.// Same train binary runs any architecture.struct TrainApp : OaVkApp {OaUnique<OaAdapter> Adapter;OaAdapterLm* Model = nullptr;OaStatus Init() override {Adapter = OaCreateAdapter(archName);Model = static_cast<OaAdapterLm*>(Adapter.get());return Model->Init(Rt);}OaStatus Tick() override {auto status = Model->Train(Rt, batch, ...);if (step % 500 == 0)Model->Save(checkpointPath);return status;}};// One binary. Any architecture. Plug and play.// ./train --arch OaLlm -d data.txt --d-model 384 --layers 6// ./train --arch OaGpt -d data.txt --d-model 768 --layers 12
Data Flow
| Layer | Component | What it does |
|---|---|---|
| Application | TrainApp, ChatApp | CLI, data loading, main loop |
| Adapter | OaAdapterLm | Train / Generate / Save / Load interface |
| Model | OaModule | Forward / Backward / Step (Level 1 or 2) |
| Engine | OaEngine | GPU dispatch, pipeline cache, memory |