← all datasets

SlimPajama

Emerging
2papers using it
14,574HF downloads
62HF likes
2025first seen

Sampled version of cerebras/SlimPajama-627B. Since the original data was shuffled before chunking, I only downloaded train/chunk1 (of 10 total) and further sampled 10%. This should result in roughly 6B tokens, hence SlimPajama-6B. The dataset is 24GBs in storage size when decompressed (original dataset is over 2TBs) an

Papers using SlimPajama (2)

SlimPajama β€” datasets β€” llm-papers