Tutorial

Text Generation — Character Model

A tiny character-level language model that predicts the next character from the previous 8. Validates the same OaEmbedding + OaLinear + OaAdamW path used by larger LLM-shaped models.

API levelLevel 1 — OaModule + autograd
DatasetInline lowercase character corpus, 27-token vocabulary
ArchitectureEmbedding(27,16) → flatten 8-context → Tanh MLP → vocab logits
OptimizerAdamW (lr=0.01, weight_decay=0.01)
Training300 steps × batch 32
Final loss0.8971 from 3.2760 initial
Batch accuracy75.0% on final minibatch
GPU throughput165,503 samples/s — RTX 5090 Laptop, hardware timestamps
SourceTutorial/Ml/TutorialTextGenerationRnn.cpp

1. The Dataset

An inline character corpus. Each training example is an 8-character context window paired with the next character label.

Corpus:  "hello world hello oa hello vulkan hello model ..."
Vocab:   a–z (tokens 0–25) + space (token 26)
Input:   previous 8 characters  [batch, 8]  UInt8
Label:   next character          [batch]     UInt8

2. The Model

Token IDs flow through an embedding lookup, a flatten reshape, a Tanh-activated dense layer, and a logit head. No dedicated RNN cell — the same primitives used by LLM-shaped models.

Tutorialtextgenerationrnn.cpp

class OaTextGenerationRnn : public OaModule {
public:
OaTextGenerationRnn() {
Embed_ = OaMakeSharedPtr<OaEmbedding>(kVocabSize, kEmbedDim); // 27, 16
Hidden_ = OaMakeSharedPtr<OaLinear>(kContextLen * kEmbedDim, kHiddenDim); // 128, 64
Head_ = OaMakeSharedPtr<OaLinear>(kHiddenDim, kVocabSize); // 64, 27
RegisterModule("embed", Embed_);
RegisterModule("hidden", Hidden_);
RegisterModule("head", Head_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& InTokens) override {
auto emb = Embed_->Forward(InTokens);
auto flat = OaFnMatrix::Reshape(emb, OaShape2D(InTokens.Size(0), kContextLen * kEmbedDim));
auto h = OaFnMatrix::Tanh(Hidden_->Forward(flat));
return Head_->Forward(h);
}
private:
OaSharedPtr<OaEmbedding> Embed_;
OaSharedPtr<OaLinear> Hidden_, Head_;
};
Layer                      Output Shape     Params
──────────────────────────── ──────────────   ───────
embedding (27→16)           [batch, 8, 16]      432
flatten context             [batch, 128]           0
dense+tanh (128→64)         [batch, 64]        8,256
dense head (64→27)          [batch, 27]        1,755
──────────────────────────── ──────────────   ───────
Total trainable parameters                    10,443

3. Training Loop

Tutorialtextgenerationrnn.cpp

static const char* kCorpus =
"hello world hello oa hello vulkan hello model "
"tiny text generation tutorial trains next token prediction "
"hello world hello oa hello vulkan hello model ";
OaFnGrad::SetMode(OaGradMode::Dynamic);
auto optimizer = OaMakeUniquePtr<OaAdamW>(model->AllParameterPtrs(), 0.01f);
TextBatchSampler sampler(kCorpus, /*batch=*/32);
OaDeviceMatrix batchX, batchY;
for (OaI32 step = 0; step < 300; ++step) {
sampler.NextBatch(batchX, batchY);
auto result = OaFnTrain::Step(rt, [&] {
trainTimer.Begin(rt);
optimizer->ZeroGrad();
auto logits = model->Forward(batchX);
auto loss = OaFnMatrix::CrossEntropyLoss(logits, batchY);
OaFnGrad::Backward(loss);
optimizer->Step();
trainTimer.End(rt);
return loss;
});
}

Training Curve — RTX 5090 Laptop, Vulkan 1.4

StepTrain LossBatch AccGPU ms
13.276021.9%warmup
751.528268.8%0.193ms
1501.137878.1%0.192ms
2251.006581.2%0.193ms
3000.897175.0%0.193ms

GPU time/step: 0.193 ± 0.013 ms (p50 = 0.186 ms · p95 = 0.215 ms)
GPU throughput: 165,503 samples/s
Wall throughput: 95,461 samples/s

4. Generate Text

After training, greedy generation encodes an 8-character context, picks the highest-probability next token, and shifts the window. This validates the inference path used by larger autoregressive models.

Tutorialtextgenerationrnn.cpp

// Greedy generation — encode prompt, shift context window, argmax each step
OaString generated = GenerateGreedy(*model, "hello", 32);
// Prompt: hello
// Generated: hello o releodoe ello oeelo iol oeoe

5. Gradient Mode Comparison

RTX 5090 Laptop

ModeInitial LossFinal LossBatch AccWall tok/s
Dynamic3.27370.968571.9%104,109
Compiled3.32140.064696.9%89,008
Auto3.30340.936375.0%105,087

Compiled mode reaches dramatically lower loss (0.06 vs 0.97) because this tiny model's activation buffers are stable across steps, enabling full graph replay. This is the scenario where OaGradMode::Compiled delivers its full benefit.

Intel Arc (ARL) iGPU — same binary

ModeFinal LossBatch AccWall tok/s
Dynamic1.193165.6%27,543
Compiled0.066396.9%33,218
Auto1.187578.1%35,297

6. Cross-Device Portability

RTX 5090 LaptopIntel Arc (ARL) iGPU
Final loss0.89710.9384
GPU time/step0.193 ms0.605 ms
GPU throughput165,503 sps52,924 sps
Wall throughput95,461 sps29,090 sps

Both devices reduce loss from the random baseline near ln(27) = 3.296 to below 1.0 in 300 steps. Zero source changes between devices.

Build & Run

Build.sh

cmake --preset release
ninja -C Build/Release tutorial_text_generation_rnn
./Bin/Release/Tutorial/tutorial_text_generation_rnn
# Run through CTest
ctest --test-dir Build/Release -R tutorial_text_generation_rnn --output-on-failure

Full Source

The complete tutorial including vocab encoding, batch sampler, greedy generation, and mode comparison is in Tutorial/Ml/TutorialTextGenerationRnn.cpp:

Tutorialtextgenerationrnn.cpp

// ═══════════════════════════════════════════════════════════════════════════
// OA Tutorial: Character Text Generation — Tiny Next-Token Model
// Level 1 API — OaModule + OaEmbedding + OaLinear + OaAdamW + OaFnGrad::Backward()
// ═══════════════════════════════════════════════════════════════════════════
//
// Architecture:
// Token IDs [batch, 8] → OaEmbedding(27, 16) → flatten →
// OaLinear(128, 64) + Tanh → OaLinear(64, 27) logits
//
// Source: oa/Tutorial/Ml/TutorialTextGenerationRnn.cpp
// ═══════════════════════════════════════════════════════════════════════════
#include <Oa/Ml.h>
#include <Oa/Ml/Metrics.h>
static constexpr OaI32 kVocabSize = 27;
static constexpr OaI32 kContextLen = 8;
static constexpr OaI32 kEmbedDim = 16;
static constexpr OaI32 kHiddenDim = 64;
class OaTextGenerationRnn : public OaModule {
public:
OaTextGenerationRnn() {
Embed_ = OaMakeSharedPtr<OaEmbedding>(kVocabSize, kEmbedDim);
Hidden_ = OaMakeSharedPtr<OaLinear>(kContextLen * kEmbedDim, kHiddenDim);
Head_ = OaMakeSharedPtr<OaLinear>(kHiddenDim, kVocabSize);
RegisterModule("embed", Embed_);
RegisterModule("hidden", Hidden_);
RegisterModule("head", Head_);
}
OaDeviceMatrix Forward(const OaDeviceMatrix& InTokens) override {
auto emb = Embed_->Forward(InTokens);
auto flat = OaFnMatrix::Reshape(emb, OaShape2D(InTokens.Size(0), kContextLen * kEmbedDim));
auto h = OaFnMatrix::Tanh(Hidden_->Forward(flat));
return Head_->Forward(h);
}
private:
OaSharedPtr<OaEmbedding> Embed_;
OaSharedPtr<OaLinear> Hidden_, Head_;
};
// See full source at oa/Tutorial/Ml/TutorialTextGenerationRnn.cpp