Training Small Digit Adder Transformer on my Laptop

February 28, 2026

I explore how very small ML models might in some cases be better trained on CPU compared to GPU. I use the recent AdderBoard as a source for models and inspiration. In one of these I was able to achieve an ~8x per-iteration speedup on my laptop (CPU, custom C++) compared to a PyTorch A100 training run.

These are very small transformer models designed to reliably compute integer addition. I suspected these models were small enough that framework and launch overheads might dominate on GPU (for the training version of this competition). So I picked the current-best trained adder, followed the instructions to reproduce training, and built an optimized version of the same training loop.

What I found: the speedup is primarily from removing a lot of framework/dispatch/allocation overhead on the CPU side and avoiding GPU kernel-launch overhead for tiny kernels. Cache residency and parallelism are also contributors. You can sometimes see apparent superlinear speedup when the per-thread working set moves into faster caches (for CPU: private L1/L2, shared L3), but that is not the only mechanism here.

The code for my C++ implementation of the training lives in this repository.

Methodology: What Is an Iteration?

Per-iteration timing in this post is intended to cover:

Excluded from per-iteration timing:

Warmup / startup handling:

Quick Look at GPU Performance (original)

I spun up a GCP A100 instance and followed the README to run training from scratch. I also traced execution to understand overhead and utilization.

Since the model is so small there is significant kernel-launch overhead, seen as gaps between meaningful compute on the trace.

Omitting traced results, per-iteration latency is about 20ms.

Configuration Total Time (s) Total Iters Excl. Startup Time (s) Excl. Startup Iters Per Iteration (ms)
1. GPU Baseline20.7s1,00020.1s999~20.1 ms

Optimize with torch.compile and CUDA Graphs

Given the launch overhead, I looked for a quick win with CUDA Graphs. CUDA Graphs record a sequence of launches and replay them, which often reduces launch/dispatch overhead for fixed-shape workloads. It helps here, but not dramatically.

Configuration Total Time (s) Total Iters Excl. Startup Time (s) Excl. Startup Iters Per Iteration (ms)
1. GPU Baseline20.7s1,00020.1s999~20.1 ms
2. GPU torch.compile26.0s1,00019.2s999~19.2 ms
3. GPU compile + CUDA Graphs24.8s1,00017.9s999~17.9 ms

Interpretation: total wall time can get worse while steady-state per-iteration improves, because torch.compile adds upfront compilation overhead.

If you care about time-to-first-1000-steps, baseline can still win. If you care about long steady-state runs, compile/graphs can win.

Quick Look at CPU Performance (original)

As alluded to earlier, I suspected this model was tiny enough that GPU overhead (kernel launch and framework plumbing) could offset raw hardware advantage for this specific workload. I also ran PyTorch on CPU as a rough sanity check.

Environment details for CPU sanity check:

Laptop (inlined C++)

The GPU beats untuned PyTorch CPU on that system, but this model is so small that it should fit in cache, so a specialized C++ implementation can push overhead down a lot. I asked Codex to write a single-file C++ training loop with model-size limits hardcoded as constexpr, which gives the compiler room for aggressive optimization (unrolling/vectorization).

This performed surprisingly well on my laptop. I also parallelized with OpenMP. For this problem size, parallelism plus cache effects can produce apparent superlinear behavior when the working set per thread drops into faster cache levels.

Raw text for the exact compile + run commands used:

CXXFLAGS="-Ofast -march=native -mtune=native -flto -ffast-math -funroll-loops -fopenmp-simd -fstrict-aliasing -fno-trapping-math -fno-math-errno -falign-functions=32 -falign-loops=32" make
./train_311p --steps 500000 --lr 0.02 --lr-stages 162000:0.001,262000:0.0003,312000:0.0001,400000:0.00005
Configuration Total Time (s) Total Iters Excl. Startup Time (s) Excl. Startup Iters Per Iteration (ms)
0. Inlined C++ Training~2.2 ms
1. GPU Baseline20.7s1,00020.1s999~20.1 ms
2. GPU torch.compile26.0s1,00019.2s999~19.2 ms
3. GPU compile + CUDA Graphs24.8s1,00017.9s999~17.9 ms
4. PyTorch CPU eager (not tuned, sanity check)31.8s200.9s19~47.0 ms

Fairness / Controls / Caveats

What I held constant (to the extent reproduced from the training script):

What I did not fully control (or did not record well enough):

This comparison is not apples-to-apples hardware benchmarking (PyTorch on A100 vs hand-written C++ on a laptop). I still think it is useful as a case study: for tiny models, overheads can dominate enough that a specialized CPU implementation wins on per-iteration latency.

GPU cache note: GPUs do have cache hierarchy, but it is different from CPU cache behavior (e.g., per-SM L1/shared memory plus a shared L2). So I am not claiming a 1:1 cache analogy. A realistic GPU path here is likely better operator fusion / fewer launches (and possibly graph capture), not assuming one giant shared-memory “megakernel” is always practical.

Validation

During training I withhold 10 sums and validate against them. I observe the same “Grokking” behavior the original author achieved.

# Operand A Operand B Actual Sum 100k (Rel. Error) 300k (Rel. Error) 500k (Rel. Error)
[0]314,159,265271,828,182585,987,44711.30%2.73%0.00%
[1]987,654,321123,456,7891,111,111,1101.00%1.01%0.00%
[2]555,555,555444,444,4451,000,000,0002.22%10.00%0.00%
[3]808,080,80890,909,090898,989,89813.71%0.10%0.00%
[4]246,813,579135,792,468382,606,0477.74%3.30%0.00%
[5]112,233,445556,677,889668,911,33412.45%1.66%0.00%
[6]99,999,99912,345,678112,345,6779.01%2.09%0.00%
[7]420,420,420133,713,371554,133,7910.35%2.85%0.00%
AVG   7.22%2.97%0.00%

Conclusion

When a problem is this small, overhead terms can dominate and move the result in unintuitive ways. In this case, removing framework/dispatch/allocation overhead and avoiding frequent tiny GPU launches produced most of the observed gain, with cache residency and parallelism helping further.

So yes, apparent superlinear effects can show up when the working set moves into faster caches, but that should be treated as one contributor rather than the sole explanation.

I still think there is room to recover more GPU performance here with better fusion and fewer launches; a monolithic training megakernel is one extreme, but not the only path.