Microgpt
Community Article Published February 12, 2026
I tried to optimize the code of microgpt newly released by @karpathy,because I think some parts might be further compressed and that's the truth
https://github.com/NJX-njx/microgpt
And I made the following optimizations:
- Weight tying — reduces parameters and memory.
- AdamW optimizer — better generalization via decoupled weight decay.
- Cosine LR schedule — smooth, stable learning-rate decay.
- Gradient clipping — prevents exploding grads; stabilizes updates.
- Train/validation split + periodic eval — detects overfitting; monitors progress.
- Fused cross-entropy — cheaper forward/backward compute.
- Top-k sampling + temperature — more coherent and diverse generation.
- Per-step timing — quick performance and throughput insights.
