← all datasets

FineWeb

Emerging
5papers using it
418,904HF downloads
2,889HF likes
2024first seen

🍷 FineWeb 15 trillion tokens of the finest data the 🌐 web has to offer What is it? The 🍷 FineWeb dataset consists of more than 18.5T tokens (originally 15T tokens) of cleaned and deduplicated english web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and ran on the 🏭 datatrove

Papers using FineWeb (5)

FineWeb β€” datasets β€” ai-for-code