Tutorial
Text Generation — Character Model
A tiny character-level language model that predicts the next character from the previous 8. Validates the same OaEmbedding + OaLinear + OaAdamW path used by larger LLM-shaped models.
| API level | Level 1 — OaModule + autograd |
| Dataset | Inline lowercase character corpus, 27-token vocabulary |
| Architecture | Embedding(27,16) → flatten 8-context → Tanh MLP → vocab logits |
| Optimizer | AdamW (lr=0.01, weight_decay=0.01) |
| Training | 300 steps × batch 32 |
| Final loss | 0.8971 from 3.2760 initial |
| Batch accuracy | 75.0% on final minibatch |
| GPU throughput | 165,503 samples/s — RTX 5090 Laptop, hardware timestamps |
| Source | Tutorial/Ml/TutorialTextGenerationRnn.cpp |
1. The Dataset
An inline character corpus. Each training example is an 8-character context window paired with the next character label.
Corpus: "hello world hello oa hello vulkan hello model ..." Vocab: a–z (tokens 0–25) + space (token 26) Input: previous 8 characters [batch, 8] UInt8 Label: next character [batch] UInt8
2. The Model
Token IDs flow through an embedding lookup, a flatten reshape, a Tanh-activated dense layer, and a logit head. No dedicated RNN cell — the same primitives used by LLM-shaped models.
Tutorialtextgenerationrnn.cpp
class OaTextGenerationRnn : public OaModule {public:OaTextGenerationRnn() {Embed_ = OaMakeSharedPtr<OaEmbedding>(kVocabSize, kEmbedDim); // 27, 16Hidden_ = OaMakeSharedPtr<OaLinear>(kContextLen * kEmbedDim, kHiddenDim); // 128, 64Head_ = OaMakeSharedPtr<OaLinear>(kHiddenDim, kVocabSize); // 64, 27RegisterModule("embed", Embed_);RegisterModule("hidden", Hidden_);RegisterModule("head", Head_);}OaDeviceMatrix Forward(const OaDeviceMatrix& InTokens) override {auto emb = Embed_->Forward(InTokens);auto flat = OaFnMatrix::Reshape(emb, OaShape2D(InTokens.Size(0), kContextLen * kEmbedDim));auto h = OaFnMatrix::Tanh(Hidden_->Forward(flat));return Head_->Forward(h);}private:OaSharedPtr<OaEmbedding> Embed_;OaSharedPtr<OaLinear> Hidden_, Head_;};
Layer Output Shape Params ──────────────────────────── ────────────── ─────── embedding (27→16) [batch, 8, 16] 432 flatten context [batch, 128] 0 dense+tanh (128→64) [batch, 64] 8,256 dense head (64→27) [batch, 27] 1,755 ──────────────────────────── ────────────── ─────── Total trainable parameters 10,443
3. Training Loop
Tutorialtextgenerationrnn.cpp
static const char* kCorpus ="hello world hello oa hello vulkan hello model ""tiny text generation tutorial trains next token prediction ""hello world hello oa hello vulkan hello model ";OaFnGrad::SetMode(OaGradMode::Dynamic);auto optimizer = OaMakeUniquePtr<OaAdamW>(model->AllParameterPtrs(), 0.01f);TextBatchSampler sampler(kCorpus, /*batch=*/32);OaDeviceMatrix batchX, batchY;for (OaI32 step = 0; step < 300; ++step) {sampler.NextBatch(batchX, batchY);auto result = OaFnTrain::Step(rt, [&] {trainTimer.Begin(rt);optimizer->ZeroGrad();auto logits = model->Forward(batchX);auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);OaFnGrad::Backward(loss);optimizer->Step();trainTimer.End(rt);return loss;});}
Training Curve — RTX 5090 Laptop, Vulkan 1.4
| Step | Train Loss | Batch Acc | GPU ms |
|---|---|---|---|
| 1 | 3.2760 | 21.9% | warmup |
| 75 | 1.5282 | 68.8% | 0.193ms |
| 150 | 1.1378 | 78.1% | 0.192ms |
| 225 | 1.0065 | 81.2% | 0.193ms |
| 300 | 0.8971 | 75.0% | 0.193ms |
GPU time/step: 0.193 ± 0.013 ms (p50 = 0.186 ms · p95 = 0.215 ms)
GPU throughput: 165,503 samples/s
Wall throughput: 95,461 samples/s
4. Generate Text
After training, greedy generation encodes an 8-character context, picks the highest-probability next token, and shifts the window. This validates the inference path used by larger autoregressive models.
Tutorialtextgenerationrnn.cpp
// Greedy generation — encode prompt, shift context window, argmax each stepOaString generated = GenerateGreedy(*model, "hello", 32);// Prompt: hello// Generated: hello o releodoe ello oeelo iol oeoe
5. Gradient Mode Comparison
RTX 5090 Laptop
| Mode | Initial Loss | Final Loss | Batch Acc | Wall tok/s |
|---|---|---|---|---|
| Dynamic | 3.2737 | 0.9685 | 71.9% | 104,109 |
| Compiled | 3.3214 | 0.0646 | 96.9% | 89,008 |
| Auto | 3.3034 | 0.9363 | 75.0% | 105,087 |
Compiled mode reaches dramatically lower loss (0.06 vs 0.97) because this tiny model's activation buffers are stable across steps, enabling full graph replay. This is the scenario where OaGradMode::Compiled delivers its full benefit.
Intel Arc (ARL) iGPU — same binary
| Mode | Final Loss | Batch Acc | Wall tok/s |
|---|---|---|---|
| Dynamic | 1.1931 | 65.6% | 27,543 |
| Compiled | 0.0663 | 96.9% | 33,218 |
| Auto | 1.1875 | 78.1% | 35,297 |
6. Cross-Device Portability
| RTX 5090 Laptop | Intel Arc (ARL) iGPU | |
|---|---|---|
| Final loss | 0.8971 | 0.9384 |
| GPU time/step | 0.193 ms | 0.605 ms |
| GPU throughput | 165,503 sps | 52,924 sps |
| Wall throughput | 95,461 sps | 29,090 sps |
Both devices reduce loss from the random baseline near ln(27) = 3.296 to below 1.0 in 300 steps. Zero source changes between devices.
Build & Run
Build.sh
cmake --preset releaseninja -C Build/Release tutorial_text_generation_rnn./Bin/Release/Tutorial/tutorial_text_generation_rnn# Run through CTestctest --test-dir Build/Release -R tutorial_text_generation_rnn --output-on-failure
Full Source
The complete tutorial including vocab encoding, batch sampler, greedy generation, and mode comparison is in Tutorial/Ml/TutorialTextGenerationRnn.cpp:
Tutorialtextgenerationrnn.cpp
// ═══════════════════════════════════════════════════════════════════════════// OA Tutorial: Character Text Generation — Tiny Next-Token Model// Level 1 API — OaModule + OaEmbedding + OaLinear + OaAdamW + OaFnGrad::Backward()// ═══════════════════════════════════════════════════════════════════════════//// Architecture:// Token IDs [batch, 8] → OaEmbedding(27, 16) → flatten →// OaLinear(128, 64) + Tanh → OaLinear(64, 27) logits//// Source: oa/Tutorial/Ml/TutorialTextGenerationRnn.cpp// ═══════════════════════════════════════════════════════════════════════════#include <Oa/Ml.h>#include <Oa/Ml/Metrics.h>static constexpr OaI32 kVocabSize = 27;static constexpr OaI32 kContextLen = 8;static constexpr OaI32 kEmbedDim = 16;static constexpr OaI32 kHiddenDim = 64;class OaTextGenerationRnn : public OaModule {public:OaTextGenerationRnn() {Embed_ = OaMakeSharedPtr<OaEmbedding>(kVocabSize, kEmbedDim);Hidden_ = OaMakeSharedPtr<OaLinear>(kContextLen * kEmbedDim, kHiddenDim);Head_ = OaMakeSharedPtr<OaLinear>(kHiddenDim, kVocabSize);RegisterModule("embed", Embed_);RegisterModule("hidden", Hidden_);RegisterModule("head", Head_);}OaDeviceMatrix Forward(const OaDeviceMatrix& InTokens) override {auto emb = Embed_->Forward(InTokens);auto flat = OaFnMatrix::Reshape(emb, OaShape2D(InTokens.Size(0), kContextLen * kEmbedDim));auto h = OaFnMatrix::Tanh(Hidden_->Forward(flat));return Head_->Forward(h);}private:OaSharedPtr<OaEmbedding> Embed_;OaSharedPtr<OaLinear> Hidden_, Head_;};// See full source at oa/Tutorial/Ml/TutorialTextGenerationRnn.cpp