Machine Learning

OaML

GPU-native training and inference on Vulkan compute. No CUDA dependency, no Python runtime, no separate inference stack. The same binary trains, checkpoints, and serves.

244K

Samples/s — MNIST (RTX 5090)

165K

Samples/s — text model (RTX 5090)

83.2%

Test accuracy — Fashion-MNIST

0

Dependencies

Welcome to the Matrix

Everything in OaML operates on OaDeviceMatrix — a GPU-resident tensor. There is no separate concept for weights, activations, or gradients at the type level. The same type flows through layers, optimizers, and loss functions.OaFnMatrix is the stateless namespace for all math operations: activations, normalisation, loss, and element-wise ops. Layers like OaLinear andOaEmbedding are thin parameter containers that call into OaFnMatrix. Vision preprocessing, model arithmetic, and kernel dispatch all go through the same matrix path.

API surface

Type / namespaceRole
OaModuleParameter container. Subclass and override Forward.
OaLinear, OaEmbeddingBuilt-in layers. Register with RegisterModule.
OaFnMatrix::*Stateless ops — Relu, Tanh, Scale, CrossEntropyLoss, Softmax, …
OaAdamWOptimizer. Takes parameter list from AllParameterPtrs().
OaFnGrad::BackwardReverse-mode autodiff from a scalar loss.
OaFnGrad::SetModeDynamic for prototyping, Compiled or Auto for production.

Tutorial — Image Classification

Fashion-MNIST end-to-end: load IDX binary, train 2,000 steps, evaluate on held-out test set. Verified at 83.2% test accuracy and 244K samples/s on RTX 5090 Laptop GPU (Vulkan 1.4, hardware timestamps). Full source: Tutorial/Ml/TutorialMnistClassifier.cpp.

Tutorialmnistclassifier.cpp

// Fashion-MNIST — 83.2% test accuracy, <1s training
// Tutorial/Ml/TutorialMnistClassifier.cpp
class OaMnistClassifier : public OaModule {
public:
OaMnistClassifier() {
Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);
Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);
RegisterModule("fc1", Fc1_);
RegisterModule("fc2", Fc2_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& x) override {
auto h = OaFnMatrix::Scale(x, 1.0f / 255.0f);
h = OaFnMatrix::Relu(Fc1_->Forward(h));
return Fc2_->Forward(h);
}
private:
OaSharedPtr<OaLinear> Fc1_, Fc2_;
};
int main() {
auto rt = OaEngine::Create({.AppName = "Mnist"}).Unwrap();
OaMnistClassifier model;
OaAdamW opt(model.AllParameterPtrs(), 0.001f);
OaFnGrad::SetMode(OaGradMode::Dynamic);
for (OaI32 step = 0; step < 2000; ++step) {
sampler.NextBatch(batchX, batchY);
auto logits = model.Forward(batchX);
auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);
OaFnGrad::Backward(loss);
opt.Step();
opt.ZeroGrad();
}
}

Tutorial — Text Generation

Character-level language model: 27-token vocabulary, 8-character context window, Embedding → Tanh MLP → logits. 300 steps, batch 32. Final loss 0.897, 165K samples/s on RTX 5090 Laptop GPU. Validates the same OaEmbedding + OaLinear + OaAdamW path used by larger LLM-shaped models. Full source: Tutorial/Ml/TutorialTextGenerationRnn.cpp.

Tutorialtextgenerationrnn.cpp

// Char-level language model — 300 steps, final loss 0.897
// Tutorial/Ml/TutorialTextGenerationRnn.cpp
class OaTextModel : public OaModule {
public:
OaTextModel() {
Embed_ = OaMakeSharedPtr<OaEmbedding>(27, 16);
Fc1_ = OaMakeSharedPtr<OaLinear>(128, 64);
Head_ = OaMakeSharedPtr<OaLinear>(64, 27);
RegisterModule("embed", Embed_);
RegisterModule("fc1", Fc1_);
RegisterModule("head", Head_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& x) override {
auto e = Embed_->Forward(x); // [batch*8, 16]
auto h = OaFnMatrix::Tanh(Fc1_->Forward(e)); // [batch, 64]
return Head_->Forward(h); // [batch, 27]
}
private:
OaSharedPtr<OaEmbedding> Embed_;
OaSharedPtr<OaLinear> Fc1_, Head_;
};