Tutorial

Image Classification — Fashion-MNIST

End-to-end image classification on real IDX binary data. Covers dataset loading, model definition, training with hardware GPU timestamps, evaluation, predictions, and cross-device portability.

API levelLevel 1 — OaModule + autograd
DatasetFashion-MNIST IDX binary — 60k train / 10k test
Architecture784 → 128 (ReLU) → 10 · 101,770 parameters
OptimizerAdamW (lr=0.001, weight_decay=0.01)
Training2,000 steps × batch 64
Test accuracy83.2% (verified, held-out test set)
GPU throughput244,473 samples/s — RTX 5090 Laptop, hardware timestamps
SourceTutorial/Ml/TutorialMnistClassifier.cpp

1. The Dataset

Fashion-MNIST contains 60,000 training images and 10,000 test images across 10 clothing categories. Each image is a 28×28 grayscale value encoded as raw uint8 (0–255) in IDX binary format. Images stay as raw bytes on the host and are normalized to [0, 1] inside Forward() — eliminating a separate preprocessing step.

IndexLabelIndexLabel
0T-shirt/top5Sandal
1Trouser6Shirt
2Pullover7Sneaker
3Dress8Bag
4Coat9Ankle boot

2. The Model

Two-layer MLP: a hidden dense layer (784→128) with ReLU activation, followed by a logit head (128→10).OaFnMatrix::Scale inside Forward normalizes pixels to [0, 1] on the GPU.

Tutorialmnistclassifier.cpp

class OaMnistClassifier : public OaModule {
public:
OaMnistClassifier() {
Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);
Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);
RegisterModule("fc1", Fc1_);
RegisterModule("fc2", Fc2_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& InX) override {
auto x = OaFnMatrix::Scale(InX, 1.0f / 255.0f); // normalize [0,255] → [0,1]
auto h = OaFnMatrix::Relu(Fc1_->Forward(x));
return Fc2_->Forward(h);
}
private:
OaSharedPtr<OaLinear> Fc1_, Fc2_;
};
Layer                            Output Shape     Params
──────────────────────────────── ──────────────   ───────
input  (Flatten 28×28)          [batch, 784]          0
dense  (784→128 + ReLU)         [batch, 128]    100,480
dense_1 (128→10)                [batch, 10]       1,290
──────────────────────────────── ──────────────   ───────
Total trainable parameters                       101,770

3. Training Loop

OaFnTrain::StepAsync submits the current step to the GPU while the CPU records the next one.OaTrainTimer wraps vkCmdWriteTimestamp2 for sub-microsecond precision — the source of all throughput numbers.

Tutorialmnistclassifier.cpp

OaFnGrad::SetMode(OaGradMode::Dynamic);
auto optimizer = OaMakeUniquePtr<OaAdamW>(model->AllParameterPtrs(), 0.001f);
auto& rt = *OaVkComputeEngine::GetGlobal();
OaTrainTimer trainTimer;
trainTimer.Init(rt, "mnist_training_step");
OaDeviceMatrix batchX, batchY;
for (OaI32 step = 0; step < 2000; ++step) {
sampler.NextBatch(batchX, batchY); // 64 random samples
auto result = OaFnTrain::StepAsync(rt, [&] {
OaFnTrain::RetainInActiveScope(batchX);
OaFnTrain::RetainInActiveScope(batchY);
trainTimer.Begin(rt);
optimizer->ZeroGrad();
auto logits = model->Forward(batchX);
auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);
OaFnGrad::Backward(loss);
optimizer->Step();
trainTimer.End(rt);
return loss;
});
(void)OaFnTrain::FlushLastAsync(rt);
float lossVal = result.Loss.Item();
trainTimer.Commit(rt.Device, kBatch);
}
trainTimer.PrintSummary();

Training Curve — RTX 5090 Laptop, Vulkan 1.4

StepTrain LossTrain AccGPU msGPU spsElapsed
12.331114.1%warmup0.00s
2000.796771.9%0.272ms235,5710.09s
4000.771567.2%0.262ms244,4990.16s
8000.535181.2%0.262ms243,9710.31s
12000.424984.4%0.261ms244,8410.46s
20000.381184.4%0.262ms244,4730.75s

GPU time/step: 0.262 ± 0.002 ms (p50 = 0.262 ms · p95 = 0.264 ms · p99 = 0.270 ms)
GPU throughput: 244,473 samples/s — hardware timestamps, ±0.5% run-to-run
Wall throughput: 171,974 samples/s — includes CPU submit and Vulkan sync overhead

4. Evaluate

Tutorialmnistclassifier.cpp

// Evaluate on full 10,000-image test set
OaU32 correct = 0;
for (OaU32 i = 0; i < numTest; i += kEvalBatch) {
auto x = OaFnMatrix::FromBytes(
OaSpan<const OaU8>(testImages.data() + i * 784, kEvalBatch * 784),
OaShape2D(kEvalBatch, 784));
auto preds = Predict(*model, x);
for (OaI32 j = 0; j < kEvalBatch; ++j) {
if (preds[j].ClassIdx == testLabels[i + j]) ++correct;
}
}
float testAcc = 100.0f * correct / numTest;
printf("Test accuracy: %.2f%%\n", testAcc);
// → Test accuracy: 83.23%

Test accuracy: 83.23% on 10,000 held-out images never seen during training.

5. Predictions

Tutorialmnistclassifier.cpp

auto logits = model->Forward(testX);
auto probs = OaFnMatrix::Softmax(logits, -1);
// host argmax + confidence per sample
#ActualPredictedConfidence
0Ankle bootAnkle boot60.4%
1PulloverPullover96.8%
2TrouserTrouser100.0%
4ShirtShirt50.3%
6CoatCoat84.0%

6. Gradient Mode Comparison

ModeTest AccWall timeWall spsGPU spsGPU Speedup
Dynamic82.75%0.74s172,267245,3431.00×
Compiled82.98%0.78s163,433271,6721.11×
Auto82.96%0.73s174,182248,1931.01×

Compiled mode runs 1.11× faster on the GPU but 5% slower wall-clock — each minibatch allocates new activation buffer handles, causing graph recompiles instead of replays. Hardware timestamps expose this; a wall-clock timer would conclude Compiled is slowest.

7. Cross-Device Portability

Same binary, same results — no source changes, no recompilation:

RTX 5090 LaptopIntel Arc (ARL) iGPU
Test accuracy83.23%83.12%
GPU time/step0.262 ms0.848 ms
GPU throughput244,473 sps75,499 sps
Wall time (2k steps)0.75s2.72s

Run.sh

# NVIDIA RTX 5090 (default discrete)
./Bin/Release/Tutorial/tutorial_mnist_classifier
# Intel iGPU
OA_DEVICE=integrated ./Bin/Release/Tutorial/tutorial_mnist_classifier
# AMD iGPU
OA_DEVICE=integrated ./Bin/Release/Tutorial/tutorial_mnist_classifier
# CPU software Vulkan (lavapipe — CI without GPU)
OA_DEVICE=cpu ./Bin/Release/Tutorial/tutorial_mnist_classifier

Build & Run

Build.sh

cmake --preset release
ninja -C Build/Release tutorial_mnist_classifier
# Run
./Bin/Release/Tutorial/tutorial_mnist_classifier
# Custom dataset path
OA_MNIST_DATA=/path/to/fashion_mnist ./Bin/Release/Tutorial/tutorial_mnist_classifier

Full Source

The complete tutorial including dataset loader, batch sampler, evaluation, predictions, and mode comparison is in Tutorial/Ml/TutorialMnistClassifier.cpp. Use the download button to save it:

Tutorialmnistclassifier.cpp

// ═══════════════════════════════════════════════════════════════════════════
// OA Tutorial: Fashion-MNIST Image Classification — Real Dataset
// Level 1 API — OaModule + OaLinear + OaAdamW + OaFnGrad::Backward()
// ═══════════════════════════════════════════════════════════════════════════
//
// Parallel structure to the TensorFlow Keras classification tutorial:
// https://www.tensorflow.org/tutorials/keras/classification
//
// TF Keras OA C++
// ───────────────────────────── ─────────────────────────────────────
// fashion_mnist.load_data() LoadMnistIDX() — inline IDX parser
// train_images / 255.0 OaFnMatrix::Scale(x, 1/255) in Forward
// tf.keras.Sequential([...]) class OaMnistClassifier : OaModule
// layers.Dense(128, 'relu') OaLinear(784, 128) + Relu()
// layers.Dense(10) OaLinear(128, 10)
// model.compile('adam', ...) OaAdamW(params, lr=0.001)
// model.fit(...) training loop + OaFnGrad::Backward()
// model.predict(x_test) OaFnMatrix::Softmax() + host argmax
//
// Source: oa/Tutorial/Ml/TutorialMnistClassifier.cpp
// ═══════════════════════════════════════════════════════════════════════════
#include <Oa/Ml.h>
#include <Oa/Ml/Metrics.h>
class OaMnistClassifier : public OaModule {
public:
OaMnistClassifier() {
Fc1_ = OaMakeSharedPtr<OaLinear>(784, 128);
Fc2_ = OaMakeSharedPtr<OaLinear>(128, 10);
RegisterModule("fc1", Fc1_);
RegisterModule("fc2", Fc2_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& InX) override {
auto x = OaFnMatrix::Scale(InX, 1.0f / 255.0f);
auto h = OaFnMatrix::Relu(Fc1_->Forward(x));
return Fc2_->Forward(h);
}
private:
OaSharedPtr<OaLinear> Fc1_, Fc2_;
};
// See full source at oa/Tutorial/Ml/TutorialMnistClassifier.cpp