Fantastic Pretraining Optimizers and Where to Find Them
Paper
•
2509.02046
•
Published
•
13
adamw1.2b24B| Hyperparameter | Value |
|---|---|
| beta1 | 0.9 |
| beta2 | 0.98 |
| epsilon | 1e-10 |
| learning_rate | 0.002 |
| max_grad_norm | 2 |
| min_lr_ratio | 0.0 |
| nesterov | False |
| train_batch_size | 256 |
| warmup | 2000 |
| weight_decay | 0.2 |