FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper β’ 2506.20920 β’ Published Jun 26, 2025 β’ 75
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model Paper β’ 2502.02737 β’ Published Feb 4, 2025 β’ 252
Towards Best Practices for Open Datasets for LLM Training Paper β’ 2501.08365 β’ Published Jan 14, 2025 β’ 62
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper β’ 2406.17557 β’ Published Jun 25, 2024 β’ 99
A Dataset and Strong Baselines for Classification of Czech News Texts Paper β’ 2307.10666 β’ Published Jul 20, 2023