Architectures
OaGptOss — Transformer
GPT-2 style transformer on Vulkan GPU compute. Direct nanoGPT comparison target. Byte-level vocabulary (256). Level 1 autograd dispatch.
Architecture
Standard GPT-2 transformer with pre-norm (LayerNorm before attention and FFN):
TokenEmbedding(256, D) + PositionalEmbedding(SeqLen, D)
→ N × [ LayerNorm → MultiHeadAttention → +res
LayerNorm → Linear(D→4D) → GELU → Linear(4D→D) → +res ]
→ LayerNorm → Linear(D, 256) → CrossEntropyModel Sizes
| Size | D | Heads | Layers | Params | VRAM (FP32) |
|---|---|---|---|---|---|
| atom | 192 | 6 | 6 | ~9.5M | ~150 MB |
| small | 768 | 12 | 12 | ~85M | ~1.3 GB |
| medium | 1024 | 16 | 24 | ~303M | ~4.6 GB |
nanoGPT Comparison
OaGptOss is a direct nanoGPT-equivalent on Vulkan GPU compute. Same architecture (GPT-2), same training procedure (AdamW + cosine LR), byte-level vocabulary. All parameters are OaDeviceMatrix — no separate tensor type.
Training
Level 1 autograd dispatch. Parameters are OaDeviceMatrix instances withSetRequiresGrad(true). OaFnGrad::Backward propagates from the scalar cross-entropy loss. OaAdamW updates via GPU compute shaders.
Inference
Autoregressive byte-by-byte generation with temperature sampling. Context window of SeqLen tokens with automatic truncation.