Tutorial
Image Classification — Fashion-MNIST
End-to-end image classification on real IDX binary data. Covers dataset loading, model definition, training with hardware GPU timestamps, evaluation, predictions, and cross-device portability.
| API level | Level 1 — OaModule + autograd |
| Dataset | Fashion-MNIST IDX binary — 60k train / 10k test |
| Architecture | 784 → 128 (ReLU) → 10 · 101,770 parameters |
| Optimizer | AdamW (lr=0.001, weight_decay=0.01) |
| Training | 2,000 steps × batch 64 |
| Test accuracy | 83.2% (verified, held-out test set) |
| GPU throughput | 244,473 samples/s — RTX 5090 Laptop, hardware timestamps |
| Source | Tutorial/Ml/TutorialMnistClassifier.cpp |
1. The Dataset
Fashion-MNIST contains 60,000 training images and 10,000 test images across 10 clothing categories. Each image is a 28×28 grayscale value encoded as raw uint8 (0–255) in IDX binary format. Images stay as raw bytes on the host and are normalized to [0, 1] inside Forward() — eliminating a separate preprocessing step.
| Index | Label | Index | Label |
|---|---|---|---|
| 0 | T-shirt/top | 5 | Sandal |
| 1 | Trouser | 6 | Shirt |
| 2 | Pullover | 7 | Sneaker |
| 3 | Dress | 8 | Bag |
| 4 | Coat | 9 | Ankle boot |
2. The Model
Two-layer MLP: a hidden dense layer (784→128) with ReLU activation, followed by a logit head (128→10).OaFnMatrix::Scale inside Forward normalizes pixels to [0, 1] on the GPU.
Tutorialmnistclassifier.cpp
class OaMnistClassifier : public OaModule {public:OaMnistClassifier() {Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);RegisterModule("fc1", Fc1_);RegisterModule("fc2", Fc2_);}OaDeviceMatrix Forward(const OaDeviceMatrix& InX) override {auto x = OaFnMatrix::Scale(InX, 1.0f / 255.0f); // normalize [0,255] → [0,1]auto h = OaFnMatrix::Relu(Fc1_->Forward(x));return Fc2_->Forward(h);}private:OaSharedPtr<OaLinear> Fc1_, Fc2_;};
Layer Output Shape Params ──────────────────────────────── ────────────── ─────── input (Flatten 28×28) [batch, 784] 0 dense (784→128 + ReLU) [batch, 128] 100,480 dense_1 (128→10) [batch, 10] 1,290 ──────────────────────────────── ────────────── ─────── Total trainable parameters 101,770
3. Training Loop
OaFnTrain::StepAsync submits the current step to the GPU while the CPU records the next one.OaTrainTimer wraps vkCmdWriteTimestamp2 for sub-microsecond precision — the source of all throughput numbers.
Tutorialmnistclassifier.cpp
OaFnGrad::SetMode(OaGradMode::Dynamic);auto optimizer = OaMakeUniquePtr<OaAdamW>(model->AllParameterPtrs(), 0.001f);auto& rt = *OaVkComputeEngine::GetGlobal();OaTrainTimer trainTimer;trainTimer.Init(rt, "mnist_training_step");OaDeviceMatrix batchX, batchY;for (OaI32 step = 0; step < 2000; ++step) {sampler.NextBatch(batchX, batchY); // 64 random samplesauto result = OaFnTrain::StepAsync(rt, [&] {OaFnTrain::RetainInActiveScope(batchX);OaFnTrain::RetainInActiveScope(batchY);trainTimer.Begin(rt);optimizer->ZeroGrad();auto logits = model->Forward(batchX);auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);OaFnGrad::Backward(loss);optimizer->Step();trainTimer.End(rt);return loss;});(void)OaFnTrain::FlushLastAsync(rt);float lossVal = result.Loss.Item();trainTimer.Commit(rt.Device, kBatch);}trainTimer.PrintSummary();
Training Curve — RTX 5090 Laptop, Vulkan 1.4
| Step | Train Loss | Train Acc | GPU ms | GPU sps | Elapsed |
|---|---|---|---|---|---|
| 1 | 2.3311 | 14.1% | warmup | — | 0.00s |
| 200 | 0.7967 | 71.9% | 0.272ms | 235,571 | 0.09s |
| 400 | 0.7715 | 67.2% | 0.262ms | 244,499 | 0.16s |
| 800 | 0.5351 | 81.2% | 0.262ms | 243,971 | 0.31s |
| 1200 | 0.4249 | 84.4% | 0.261ms | 244,841 | 0.46s |
| 2000 | 0.3811 | 84.4% | 0.262ms | 244,473 | 0.75s |
GPU time/step: 0.262 ± 0.002 ms (p50 = 0.262 ms · p95 = 0.264 ms · p99 = 0.270 ms)
GPU throughput: 244,473 samples/s — hardware timestamps, ±0.5% run-to-run
Wall throughput: 171,974 samples/s — includes CPU submit and Vulkan sync overhead
4. Evaluate
Tutorialmnistclassifier.cpp
// Evaluate on full 10,000-image test setOaU32 correct = 0;for (OaU32 i = 0; i < numTest; i += kEvalBatch) {auto x = OaFnMatrix::FromBytes(OaSpan<const OaU8>(testImages.data() + i * 784, kEvalBatch * 784),OaShape2D(kEvalBatch, 784));auto preds = Predict(*model, x);for (OaI32 j = 0; j < kEvalBatch; ++j) {if (preds[j].ClassIdx == testLabels[i + j]) ++correct;}}float testAcc = 100.0f * correct / numTest;printf("Test accuracy: %.2f%%\n", testAcc);// → Test accuracy: 83.23%
Test accuracy: 83.23% on 10,000 held-out images never seen during training.
5. Predictions
Tutorialmnistclassifier.cpp
auto logits = model->Forward(testX);auto probs = OaFnMatrix::Softmax(logits, -1);// host argmax + confidence per sample
| # | Actual | Predicted | Confidence | |
|---|---|---|---|---|
| 0 | Ankle boot | Ankle boot | 60.4% | ✓ |
| 1 | Pullover | Pullover | 96.8% | ✓ |
| 2 | Trouser | Trouser | 100.0% | ✓ |
| 4 | Shirt | Shirt | 50.3% | ✓ |
| 6 | Coat | Coat | 84.0% | ✓ |
6. Gradient Mode Comparison
| Mode | Test Acc | Wall time | Wall sps | GPU sps | GPU Speedup |
|---|---|---|---|---|---|
| Dynamic | 82.75% | 0.74s | 172,267 | 245,343 | 1.00× |
| Compiled | 82.98% | 0.78s | 163,433 | 271,672 | 1.11× |
| Auto | 82.96% | 0.73s | 174,182 | 248,193 | 1.01× |
Compiled mode runs 1.11× faster on the GPU but 5% slower wall-clock — each minibatch allocates new activation buffer handles, causing graph recompiles instead of replays. Hardware timestamps expose this; a wall-clock timer would conclude Compiled is slowest.
7. Cross-Device Portability
Same binary, same results — no source changes, no recompilation:
| RTX 5090 Laptop | Intel Arc (ARL) iGPU | |
|---|---|---|
| Test accuracy | 83.23% | 83.12% |
| GPU time/step | 0.262 ms | 0.848 ms |
| GPU throughput | 244,473 sps | 75,499 sps |
| Wall time (2k steps) | 0.75s | 2.72s |
Run.sh
# NVIDIA RTX 5090 (default discrete)./Bin/Release/Tutorial/tutorial_mnist_classifier# Intel iGPUOA_DEVICE=integrated ./Bin/Release/Tutorial/tutorial_mnist_classifier# AMD iGPUOA_DEVICE=integrated ./Bin/Release/Tutorial/tutorial_mnist_classifier# CPU software Vulkan (lavapipe — CI without GPU)OA_DEVICE=cpu ./Bin/Release/Tutorial/tutorial_mnist_classifier
Build & Run
Build.sh
cmake --preset releaseninja -C Build/Release tutorial_mnist_classifier# Run./Bin/Release/Tutorial/tutorial_mnist_classifier# Custom dataset pathOA_MNIST_DATA=/path/to/fashion_mnist ./Bin/Release/Tutorial/tutorial_mnist_classifier
Full Source
The complete tutorial including dataset loader, batch sampler, evaluation, predictions, and mode comparison is in Tutorial/Ml/TutorialMnistClassifier.cpp. Use the download button to save it:
Tutorialmnistclassifier.cpp
// ═══════════════════════════════════════════════════════════════════════════// OA Tutorial: Fashion-MNIST Image Classification — Real Dataset// Level 1 API — OaModule + OaLinear + OaAdamW + OaFnGrad::Backward()// ═══════════════════════════════════════════════════════════════════════════//// Parallel structure to the TensorFlow Keras classification tutorial:// https://www.tensorflow.org/tutorials/keras/classification//// TF Keras OA C++// ───────────────────────────── ─────────────────────────────────────// fashion_mnist.load_data() LoadMnistIDX() — inline IDX parser// train_images / 255.0 OaFnMatrix::Scale(x, 1/255) in Forward// tf.keras.Sequential([...]) class OaMnistClassifier : OaModule// layers.Dense(128, 'relu') OaLinear(784, 128) + Relu()// layers.Dense(10) OaLinear(128, 10)// model.compile('adam', ...) OaAdamW(params, lr=0.001)// model.fit(...) training loop + OaFnGrad::Backward()// model.predict(x_test) OaFnMatrix::Softmax() + host argmax//// Source: oa/Tutorial/Ml/TutorialMnistClassifier.cpp// ═══════════════════════════════════════════════════════════════════════════#include <Oa/Ml.h>#include <Oa/Ml/Metrics.h>class OaMnistClassifier : public OaModule {public:OaMnistClassifier() {Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);RegisterModule("fc1", Fc1_);RegisterModule("fc2", Fc2_);}OaDeviceMatrix Forward(const OaDeviceMatrix& InX) override {auto x = OaFnMatrix::Scale(InX, 1.0f / 255.0f);auto h = OaFnMatrix::Relu(Fc1_->Forward(x));return Fc2_->Forward(h);}private:OaSharedPtr<OaLinear> Fc1_, Fc2_;};// See full source at oa/Tutorial/Ml/TutorialMnistClassifier.cpp