Microgpt

Community Article Published February 12, 2026

I tried to optimize the code of microgpt newly released by @karpathy,because I think some parts might be further compressed and that's the truth

https://github.com/NJX-njx/microgpt

And I made the following optimizations:

  • Weight tying — reduces parameters and memory.
  • AdamW optimizer — better generalization via decoupled weight decay.
  • Cosine LR schedule — smooth, stable learning-rate decay.
  • Gradient clipping — prevents exploding grads; stabilizes updates.
  • Train/validation split + periodic eval — detects overfitting; monitors progress.
  • Fused cross-entropy — cheaper forward/backward compute.
  • Top-k sampling + temperature — more coherent and diverse generation.
  • Per-step timing — quick performance and throughput insights.

image

Community

Sign up or log in to comment