FineWeb
Emerging5papers using it
418,904HF downloads
2,889HF likes
2024first seen
π· FineWeb 15 trillion tokens of the finest data the π web has to offer What is it? The π· FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the π datatrove
π€ Hugging Faceβ odc-by
Papers using FineWeb (5)
- FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every LanguageFinerWeb-10BT: Refining Web Data with LLM-Based Line-Level FilteringThe FineWeb Datasets: Decanting the Web for the Finest Text Data at
ScaleProgramming Every Example: Lifting Pre-training Data Quality like
Experts at ScaleThe FineWeb Datasets: Decanting the Web for the Finest Text Data at
Scale