Performance

Benchmarks

Real measurements on real hardware. All GPU compute. No CUDA. No vendor SDK.

Test Hardware

ComponentSpec
GPUIntel HD Graphics 620 (KBL GT2, 24 EUs)
MachineThinkPad X1 Carbon 5th Gen (2017)
TypeIntegrated GPU — shared system memory
BackendNative GPU compute + Slang + bindless descriptors

Running on a 2017 laptop iGPU. No discrete GPU. No CUDA cores. No tensor cores. Just 24 execution units on shared system memory.

Memory — OaMemcpy vs std::memcpy

Sizestd::memcpyOaMemcpySpeedup
1 KB7.53 GB/s8.46 GB/s1.12x
64 KB10.49 GB/s13.91 GB/s1.33x
1 MB11.70 GB/s11.70 GB/s1.00x
16 MB7.98 GB/s10.63 GB/s1.33x
64 MB2.12 GB/s10.82 GB/s5.10x

GPU Compute — Dispatch Performance

OperationElementsTimeThroughput
Vector Add1M0.55 ms7.28 GB/s
Vector Add16M7.80 ms8.21 GB/s
Scale1M0.44 ms9.16 GB/s
Matmul (256x256)65K0.66 ms
Matmul (512x512)262K5.59 ms

OaComputeGraph — Replay vs Execute

Chain LengthExecute()Replay()Speedup
5 dispatches2.55 ms0.61 ms4.19x
10 dispatches4.79 ms1.17 ms4.10x
25 dispatches12.70 ms2.79 ms4.55x

Post-Quantum Crypto

OperationThroughput
SHAKE-256 hash3.2M hashes/sec
Dilithium-3 sign45K signs/sec (CPU)
Dilithium-3 verify (batch, GPU)1.26M verifies/sec
Merkle root (1M leaves)12 ms

Full benchmark suite available in oa/tests/bench_hpc.cpp. Run with cd build/release && ../../bin/release/tests/bench_hpc.