← all datasets

FineWeb

Emerging
2papers using it
426,645HF downloads
2,889HF likes
2026first seen

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove

Papers using FineWeb (2)