Microgpt

Community Article Published February 12, 2026

I tried to optimize the code of microgpt newly released by @karpathy,because I think some parts might be further compressed and that's the truth

And I made the following optimizations：

Weight tying — reduces parameters and memory.
AdamW optimizer — better generalization via decoupled weight decay.
Cosine LR schedule — smooth, stable learning-rate decay.
Gradient clipping — prevents exploding grads; stabilizes updates.
Train/validation split + periodic eval — detects overfitting; monitors progress.
Fused cross-entropy — cheaper forward/backward compute.
Top-k sampling + temperature — more coherent and diverse generation.
Per-step timing — quick performance and throughput insights.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment