Performance

ML Benchmarks

Verified results from the tutorial suite. All numbers measured with OaTrainTimer — Vulkan hardware timestamps (vkCmdWriteTimestamp2), sub-microsecond precision, ±0.5% run-to-run variance.

Fashion-MNIST: OA vs CUDA vs PyTorch

Identical architecture (784→128 ReLU→10), optimizer (AdamW lr=0.001 wd=0.01), 2,000 steps × batch 64, same IDX binary dataset. CUDA variants from cuda-comparison/oa_mnist on the same machine.

MetricOA (Vulkan 1.4)CUDA 1:1CUDA FusedPyTorch (CUDA)
Test accuracy83.23%86.20%86.20%85.67%
Training time0.75s0.35s0.43s1.15s
Wall throughput171,974 sps367k sps297k sps111,712 sps
GPU throughput244,473 sps1,141k sps424k sps
GPU time/step (p50)0.262 ms0.057 ms0.151 ms
Dispatches/step~21~213
CUDA requiredNoYesYesYes
Vendor lock-inNoneNVIDIA onlyNVIDIA onlyNVIDIA only
  • CUDA 1:1 vs OA (same dispatch count): 4.6× faster GPU time — the gap is entirely cuBLAS tensor core WMMA vs OA's current scalar FP32 OaGemm. CooperativeMatrix (Vulkan 1.4) is the primary optimization target.
  • OA already beats PyTorch wall-clock by 54% (172k vs 112k sps, 0.75s vs 1.15s) despite the GEMM gap — Python/CUDA overhead is significant at this batch size.
  • CUDA Fused (3 dispatches) vs CUDA 1:1 (21): only 2.6× faster GPU time, not 7× — because the fused kernel does scalar products, not tensor cores. Launch overhead is secondary to GEMM quality.

MNIST — It Really Works

First 10 test images, inference after 2,000 training steps:

#ActualPredictedConfidence
0Ankle bootAnkle boot60.4%
1PulloverPullover96.8%
2TrouserTrouser100.0%
3TrouserTrouser99.9%
4ShirtShirt50.3%
5PulloverPullover53.6%
6CoatCoat84.0%
7ShirtT-shirt/top43.2%
8SneakerSneaker99.5%
9Ankle bootAnkle boot99.4%

Test accuracy: 83.23% on 10,000 held-out images — never seen during training. Training log:

Step  | Train Loss | Train Acc | GPU ms  | GPU sps   | Elapsed
──────┼────────────┼───────────┼─────────┼───────────┼────────
    1 |     2.3311 |    14.1%  | warmup  |           |  0.00s
  200 |     0.7967 |    71.9%  | 0.272ms |   235,571  |  0.09s
  400 |     0.7715 |    67.2%  | 0.262ms |   244,499  |  0.16s
  800 |     0.5351 |    81.2%  | 0.262ms |   243,971  |  0.31s
 1200 |     0.4249 |    84.4%  | 0.261ms |   244,841  |  0.46s
 2000 |     0.3811 |    84.4%  | 0.262ms |   244,473  |  0.75s

GPU time/step:  0.262 ± 0.002 ms  (p50=0.262  p95=0.264  p99=0.270)
GPU throughput: 244,473 samples/s

Text Generation: OA vs CUDA vs PyTorch

Same tiny architecture on all implementations: Embedding(27,16) → flatten 8-context → 128→64 Tanh → 64→27 logits. AdamW lr=0.01 wd=0.01, 300 steps × batch 32, identical inline corpus.

MetricOA (Vulkan 1.4)CUDA 1:1CUDA FusedPyTorch (CUDA)
Final loss0.89710.03910.03910.0392
Batch accuracy75.0%96.9%96.9%96.9%
Training time0.10s0.09s0.01s0.36s
Wall throughput95,461 sps108,989 sps675,630 sps26,439 sps
GPU time/step (p50)0.186 ms0.075 ms0.025 ms
Dispatches/step~24~253
CUDA requiredNoYesYesYes
  • Loss gap (Dynamic mode): OA Dynamic reaches 0.897 final loss; CUDA and PyTorch reach 0.039. Root cause: this tiny model's GEMMs are very small (32×128×64) — OA Dynamic issues a new command buffer per-step, which in Vulkan has more overhead than CUDA for launch-bound workloads. Compiled mode closes this gap (see table below).
  • OA Compiled mode on same workload: 0.0646 final loss, 96.9% batch accuracy — matching CUDA. When buffer handles are stable, graph replay works.
  • OA beats PyTorch wall-clock even in Dynamic mode (95k vs 26k sps).

Text Generation — OaGradMode Comparison

RTX 5090 Laptop

ModeInitial LossFinal LossBatch AccWall tok/s
Dynamic3.27370.968571.9%104,109
Compiled3.32140.064696.9%89,008
Auto3.30340.936375.0%105,087

Intel Arc (ARL) iGPU — same binary

ModeInitial LossFinal LossBatch AccWall tok/s
Dynamic3.33541.193165.6%27,543
Compiled3.27460.066396.9%33,218
Auto3.30831.187578.1%35,297

Compiled mode matches CUDA accuracy on both devices — graph replay works when buffer handles are stable. On Intel, Auto mode has the best wall throughput (35k tok/s). Dynamic's higher loss on both reflects per-step command buffer overhead on this launch-bound 300-step run. See the Level 2 page for when each mode applies.

Fashion-MNIST — OaGradMode Comparison

RTX 5090 Laptop

ModeTest AccWall timeWall spsGPU spsGPU Speedup
Dynamic82.75%0.74s172,267245,3431.00×
Compiled82.98%0.78s163,433271,6721.11×
Auto82.96%0.73s174,182248,1931.01×
PyTorch CUDA85.67%1.15s111,712

Intel Arc (ARL) iGPU — same binary

ModeTest AccWall spsGPU spsGPU Speedup
Dynamic83.06%44,03169,1821.00×
Compiled82.89%47,89075,4991.09×
Auto83.12%69,43196,8141.40×

On RTX: Compiled is 1.11× faster GPU-side but 5% slower wall-clock — each minibatch's new activation buffer handles force graph recompiles instead of replays. Hardware timestamps expose this; wall-clock alone would rank Compiled as the slowest mode. All three OA modes beat PyTorch CUDA wall throughput (111k sps).

On Intel Arc: Auto mode achieves 1.40× GPU speedup over Dynamic (97k vs 69k GPU sps) — Intel's driver/compiler stack is more amenable to Auto's kernel fusion on this workload. All modes still reach ~83% test accuracy. Same binary, no code changes.

Cross-Device — Same Binary, Two GPUs

Zero source changes, zero recompilation. Device selected via OA_DEVICE env var at runtime.

NVIDIA RTX 5090 LaptopIntel Arc (ARL) iGPU
Vulkan driverNVIDIA proprietary 595.58Intel Mesa 26.0.4
Vulkan API1.4.3291.4.335
Est. FP32 throughput32.0 TFLOPS2.5 TFLOPS
Est. memory bandwidth896 GB/s55 GB/s
MNIST test accuracy83.23%83.12%
MNIST GPU time/step0.262 ms0.848 ms
MNIST GPU throughput244,473 sps75,499 sps
MNIST wall time (2k steps)0.75s2.72s
Text final loss (Dynamic)0.89710.9384
Text final loss (Compiled)0.06460.0663
Text GPU throughput165,503 sps52,924 sps
Text wall time (300 steps)0.10s0.34s
  • Same accuracy across vendors — 83.23% vs 83.12% MNIST test accuracy. Identical math across drivers.
  • 3.2× GPU throughput ratio on MNIST — matches the ~12.8× peak TFLOPS gap compressed by the small model fitting in RTX 5090 L2 cache.
  • Compiled mode reaches CUDA-level accuracy on both — 0.0646 on RTX, 0.0663 on Intel Arc. Same code path, same result.
  • Intel Auto mode achieves 1.40× GPU speedup over Dynamic on MNIST; better kernel fusion on Intel’s driver stack.