Common Crawl
Emerging6papers using it
2024first seen
Common Crawl is a dataset that contains a vast collection of web pages and is used to evaluate the performance of language models on natural language tasks.
Papers using Common Crawl (6)
- Understanding Data Temporality Impact on Large Language Models Pre-trainingTraining Language Models via Neural Cellular AutomataFineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every LanguageWasm: A Pipeline for Constructing Structured Arabic Interleaved
Multimodal CorporaTiC-LM: A Web-Scale Benchmark for Time-Continual LLM PretrainingUnifiedCrawl: Aggregated Common Crawl for Affordable Adaptation of LLMs
on Low-Resource Languages