C4

Name: C4
License: odc-by

Canonical

6papers using it

833,248HF downloads

598HF likes

2024first seen

C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, th

🤗 Hugging Face⚖ odc-by

Papers using C4 (6)

Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text With Multi-dimensional And Fine-grained Information2024 · 2 cites

GradPower: Powering Gradients for Faster Language Model Pre-Training2025

Layer-wise dynamic rank for compressing large language models2025

Strong Membership Inference Attacks on Massive Datasets and (Moderately) Large Language Models2025

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights2025

The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training2025