Machine Learning
OaML
GPU-native training and inference on Vulkan compute. No CUDA dependency, no Python runtime, no separate inference stack. The same binary trains, checkpoints, and serves.
244K
Samples/s — MNIST (RTX 5090)
165K
Samples/s — text model (RTX 5090)
83.2%
Test accuracy — Fashion-MNIST
0
Dependencies
Welcome to the Matrix
Everything in OaML operates on OaDeviceMatrix — a GPU-resident tensor. There is no separate concept for weights, activations, or gradients at the type level. The same type flows through layers, optimizers, and loss functions.OaFnMatrix is the stateless namespace for all math operations: activations, normalisation, loss, and element-wise ops. Layers like OaLinear andOaEmbedding are thin parameter containers that call into OaFnMatrix. Vision preprocessing, model arithmetic, and kernel dispatch all go through the same matrix path.
API surface
| Type / namespace | Role |
|---|---|
OaModule | Parameter container. Subclass and override Forward. |
OaLinear, OaEmbedding | Built-in layers. Register with RegisterModule. |
OaFnMatrix::* | Stateless ops — Relu, Tanh, Scale, CrossEntropyLoss, Softmax, … |
OaAdamW | Optimizer. Takes parameter list from AllParameterPtrs(). |
OaFnGrad::Backward | Reverse-mode autodiff from a scalar loss. |
OaFnGrad::SetMode | Dynamic for prototyping, Compiled or Auto for production. |
Tutorial — Image Classification
Fashion-MNIST end-to-end: load IDX binary, train 2,000 steps, evaluate on held-out test set. Verified at 83.2% test accuracy and 244K samples/s on RTX 5090 Laptop GPU (Vulkan 1.4, hardware timestamps). Full source: Tutorial/Ml/TutorialMnistClassifier.cpp.
Tutorialmnistclassifier.cpp
// Fashion-MNIST — 83.2% test accuracy, <1s training// Tutorial/Ml/TutorialMnistClassifier.cppclass OaMnistClassifier : public OaModule {public:OaMnistClassifier() {Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);RegisterModule("fc1", Fc1_);RegisterModule("fc2", Fc2_);}OaDeviceMatrix Forward(const OaDeviceMatrix& x) override {auto h = OaFnMatrix::Scale(x, 1.0f / 255.0f);h = OaFnMatrix::Relu(Fc1_->Forward(h));return Fc2_->Forward(h);}private:OaSharedPtr<OaLinear> Fc1_, Fc2_;};int main() {auto rt = OaEngine::Create({.AppName = "Mnist"}).Unwrap();OaMnistClassifier model;OaAdamW opt(model.AllParameterPtrs(), 0.001f);OaFnGrad::SetMode(OaGradMode::Dynamic);for (OaI32 step = 0; step < 2000; ++step) {sampler.NextBatch(batchX, batchY);auto logits = model.Forward(batchX);auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);OaFnGrad::Backward(loss);opt.Step();opt.ZeroGrad();}}
Tutorial — Text Generation
Character-level language model: 27-token vocabulary, 8-character context window, Embedding → Tanh MLP → logits. 300 steps, batch 32. Final loss 0.897, 165K samples/s on RTX 5090 Laptop GPU. Validates the same OaEmbedding + OaLinear + OaAdamW path used by larger LLM-shaped models. Full source: Tutorial/Ml/TutorialTextGenerationRnn.cpp.
Tutorialtextgenerationrnn.cpp
// Char-level language model — 300 steps, final loss 0.897// Tutorial/Ml/TutorialTextGenerationRnn.cppclass OaTextModel : public OaModule {public:OaTextModel() {Embed_ = OaMakeSharedPtr<OaEmbedding>(27, 16);Fc1_ = OaMakeSharedPtr<OaLinear>(128, 64);Head_ = OaMakeSharedPtr<OaLinear>(64, 27);RegisterModule("embed", Embed_);RegisterModule("fc1", Fc1_);RegisterModule("head", Head_);}OaDeviceMatrix Forward(const OaDeviceMatrix& x) override {auto e = Embed_->Forward(x); // [batch*8, 16]auto h = OaFnMatrix::Tanh(Fc1_->Forward(e)); // [batch, 64]return Head_->Forward(h); // [batch, 27]}private:OaSharedPtr<OaEmbedding> Embed_;OaSharedPtr<OaLinear> Fc1_, Head_;};