← all datasets

C4

Canonical
6papers using it
833,248HF downloads
598HF likes
2024first seen

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, th

Papers using C4 (6)

C4 β€” datasets β€” llm-papers