SlimPajama
Emerging2papers using it
14,574HF downloads
62HF likes
2025first seen
Sampled version of cerebras/SlimPajama-627B. Since the original data was shuffled before chunking, I only downloaded train/chunk1 (of 10 total) and further sampled 10%. This should result in roughly 6B tokens, hence SlimPajama-6B. The dataset is 24GBs in storage size when decompressed (original dataset is over 2TBs) an