NN Arch Components
updated
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper
• 2601.22966
• Published
STEM: Scaling Transformers with Embedding Modules
Paper
• 2601.10639
• Published
• 2
Paper
• 2601.00417
• Published
• 34
mHC: Manifold-Constrained Hyper-Connections
Paper
• 2512.24880
• Published
• 311
VersatileFFN: Achieving Parameter Efficiency in LLMs via Adaptive Wide-and-Deep Reuse
Paper
• 2512.14531
• Published
• 15
Stronger Normalization-Free Transformers
Paper
• 2512.10938
• Published
• 21
Gated Attention for Large Language Models: Non-linearity, Sparsity, and
Attention-Sink-Free
Paper
• 2505.06708
• Published
• 11
Transformers without Normalization
Paper
• 2503.10622
• Published
• 170
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper
• 2503.02130
• Published
• 32
Paper
• 2409.19606
• Published
• 26
Paper
• 2511.11238
• Published
• 38