Architectures

OaGptOss — Transformer

GPT-2 style transformer on Vulkan GPU compute. Direct nanoGPT comparison target. Byte-level vocabulary (256). Level 1 autograd dispatch.

Architecture

Standard GPT-2 transformer with pre-norm (LayerNorm before attention and FFN):

TokenEmbedding(256, D) + PositionalEmbedding(SeqLen, D)
→ N × [ LayerNorm → MultiHeadAttention → +res
         LayerNorm → Linear(D→4D) → GELU → Linear(4D→D) → +res ]
→ LayerNorm → Linear(D, 256) → CrossEntropy

Model Sizes

Size	D	Heads	Layers	Params	VRAM (FP32)
atom	192	6	6	~9.5M	~150 MB
small	768	12	12	~85M	~1.3 GB
medium	1024	16	24	~303M	~4.6 GB

nanoGPT Comparison

OaGptOss is a direct nanoGPT-equivalent on Vulkan GPU compute. Same architecture (GPT-2), same training procedure (AdamW + cosine LR), byte-level vocabulary. All parameters are OaDeviceMatrix — no separate tensor type.

Training

Level 1 autograd dispatch. Parameters are OaDeviceMatrix instances withSetRequiresGrad(true). OaFnGrad::Backward propagates from the scalar cross-entropy loss. OaAdamW updates via GPU compute shaders.

Inference

Autoregressive byte-by-byte generation with temperature sampling. Context window of SeqLen tokens with automatic truncation.