Performance

Benchmarks

Cross-vendor, cross-stack profiling benchmarks. Measure what matters on the hardware you have.

Matrix Multiplication Benchmark

Tutorial: Matrix Multiplication —OaFnMatrix::MatMul through OaContext. All numbers are wall-clock including full public API path (submit + sync overhead). Methodology: 50 iterations after 5 warmup runs, RTX 5090 Laptop GPU.

Single-Dispatch Throughput

Shape	M	N	K	Wall ms	OA BF16	OA Fp32	Ref TF32	OA / Ref
square-512	512	512	512	0.082	3,280	2,582	32,027	10.2%
square-1024	1024	1024	1024	0.138	15,548	6,215	42,142	36.9%
square-2048	2048	2048	2048	0.897	19,159	10,470	38,650	49.6%
tall-skinny	4096	128	1024	0.153	7,004	4,786	37,377	18.7%
short-wide	128	4096	1024	0.135	7,958	5,016	39,721	20.0%
gemv-decode	1	4096	4096	0.903	37	49	420	8.8%

Device theoretical peak: 64 TFLOPS (BF16 tensor cores). Throughput climbs with problem size as fixed per-dispatch overhead amortizes, peaking near 19.2 TFLOP/s at 2048³.

Batch-Dispatch Throughput (4 ops per submit)

Shape	M	N	K	OA BF16	OA Fp32	Ref TF32 (single)	OA Batch / Ref Single
square-512	512	512	512	34,031	26,888	35,225	97%
square-1024	1024	1024	1024	77,603	43,308	81,399	95%
tall-skinny	2048	128	1024	34,978	28,165	74,599	47%

Batching amortizes CPU→GPU submission overhead. For small ops where GPU time ≈ submit latency, batching provides 4–8× speedup vs single-dispatch.

Cross-Device Portability

	RTX 5090 Laptop	Intel ARL iGPU
Precision path	BF16 CoopMat (tensor cores)	FP32 tiled/naive
norm_err (2048³)	~5e-3	0.00 (bit-exact)
Throughput (2048³)	19.2 TFLOP/s	259 GFLOP/s
Theoretical peak	64 TFLOPS BF16	2.5 TFLOPS FP32
% of peak (2048³)	30.0%	10.4%

Same binary, same source — runtime device selection via OA_DEVICE=integrated. Intel iGPU runs the FP32 fallback path at 259 GFLOP/s (10.4% of 2.5 TFLOP theoretical peak).

Run.sh

cmake --build Build/Release --target TutorialCoreMatMulIntro -j
./Bin/Release/Tutorial/Core/TutorialCoreMatMulIntro