Fineweb-2
Emerging4papers using it
2025first seen
Fineweb2 is a benchmark used to evaluate heuristic filtering methods for curating multilingual training data for large language models.
Papers using Fineweb-2 (4)
- Enhancing Multilingual LLM Pretraining with Model-Based Data SelectionJudging Quality Across Languages: A Multilingual Approach to Pretraining
Data Filtering with Language ModelsFineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every LanguageJudging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models