← all datasets

The Pile

Canonical
3papers using it
2024first seen

An 825 GB diverse open text corpus (22 sources) for training large language models.

Papers using The Pile (3)

The Pile β€” datasets β€” llm-papers