SlimPajama

Emerging

2papers using it

14,574HF downloads

62HF likes

2025first seen

Sampled version of cerebras/SlimPajama-627B. Since the original data was shuffled before chunking, I only downloaded train/chunk1 (of 10 total) and further sampled 10%. This should result in roughly 6B tokens, hence SlimPajama-6B. The dataset is 24GBs in storage size when decompressed (original dataset is over 2TBs) an

🤗 Hugging Face

Papers using SlimPajama (2)

DataFlex: A Unified Framework for Data-Centric Dynamic Training of Large Language Models2026

GRAPE: Optimize Data Mixture for Group Robust Multi-target Adaptive Pretraining2025