C4
Canonical6papers using it
833,248HF downloads
598HF likes
2024first seen
C4 Dataset Summary A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org". This is the processed version of Google's C4 dataset We prepared five variants of the data: en, en.noclean, en.noblocklist, realnewslike, and multilingual (mC4). For reference, th
π€ Hugging Faceβ odc-by
Papers using C4 (6)
- Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text With Multi-dimensional And Fine-grained InformationGradPower: Powering Gradients for Faster Language Model Pre-TrainingLayer-wise dynamic rank for compressing large language modelsStrong Membership Inference Attacks on Massive Datasets and (Moderately)
Large Language ModelsSINQ: Sinkhorn-Normalized Quantization for Calibration-Free
Low-Precision LLM WeightsThe Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training