Performance
Benchmarks
Real measurements on real hardware. All GPU compute. No CUDA. No vendor SDK.
Test Hardware
| Component | Spec |
|---|
| GPU | Intel HD Graphics 620 (KBL GT2, 24 EUs) |
| Machine | ThinkPad X1 Carbon 5th Gen (2017) |
| Type | Integrated GPU — shared system memory |
| Backend | Native GPU compute + Slang + bindless descriptors |
Running on a 2017 laptop iGPU. No discrete GPU. No CUDA cores. No tensor cores. Just 24 execution units on shared system memory.
Memory — OaMemcpy vs std::memcpy
| Size | std::memcpy | OaMemcpy | Speedup |
|---|
| 1 KB | 7.53 GB/s | 8.46 GB/s | 1.12x |
| 64 KB | 10.49 GB/s | 13.91 GB/s | 1.33x |
| 1 MB | 11.70 GB/s | 11.70 GB/s | 1.00x |
| 16 MB | 7.98 GB/s | 10.63 GB/s | 1.33x |
| 64 MB | 2.12 GB/s | 10.82 GB/s | 5.10x |
GPU Compute — Dispatch Performance
| Operation | Elements | Time | Throughput |
|---|
| Vector Add | 1M | 0.55 ms | 7.28 GB/s |
| Vector Add | 16M | 7.80 ms | 8.21 GB/s |
| Scale | 1M | 0.44 ms | 9.16 GB/s |
| Matmul (256x256) | 65K | 0.66 ms | — |
| Matmul (512x512) | 262K | 5.59 ms | — |
OaComputeGraph — Replay vs Execute
| Chain Length | Execute() | Replay() | Speedup |
|---|
| 5 dispatches | 2.55 ms | 0.61 ms | 4.19x |
| 10 dispatches | 4.79 ms | 1.17 ms | 4.10x |
| 25 dispatches | 12.70 ms | 2.79 ms | 4.55x |
Post-Quantum Crypto
| Operation | Throughput |
|---|
| SHAKE-256 hash | 3.2M hashes/sec |
| Dilithium-3 sign | 45K signs/sec (CPU) |
| Dilithium-3 verify (batch, GPU) | 1.26M verifies/sec |
| Merkle root (1M leaves) | 12 ms |
Full benchmark suite available in oa/tests/bench_hpc.cpp. Run with cd build/release && ../../bin/release/tests/bench_hpc.