Common Crawl

Emerging

6papers using it

2024first seen

Common Crawl is a dataset that contains a vast collection of web pages and is used to evaluate the performance of language models on natural language tasks.

🔎 Find this dataset

Papers using Common Crawl (6)

Understanding Data Temporality Impact on Large Language Models Pre-training2026

Training Language Models via Neural Cellular Automata2026

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language2025

Wasm: A Pipeline for Constructing Structured Arabic Interleaved Multimodal Corpora2025

TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining2025

UnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs on Low-Resource Languages2024 · 1 cites