Awesome Large Language Models
π
Papers
π§
Topics
π₯
Trending
πΊοΈ
Map
π
Leaderboards
π
Learn
π€
Ask AI
β―
More
π₯
Authors
π
Reading Packs
π
Datasets
π οΈ
Tools
π°
News
π
Blogs
βοΈ
Newsletter
π
Saved
+ Add Paper
βΎ
β
β all datasets
The Pile
Canonical
3
papers using it
2024
first seen
An 825 GB diverse open text corpus (22 sources) for training large language models.
π Find this dataset
Papers using The Pile (3)
Chinesewebtext 2.0: Large-scale High-quality Chinese Web Text With Multi-dimensional And Fine-grained Information
2024 Β· 2 cites
Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better
2026
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts
2026
π€
Ask AI
The Pile β datasets β llm-papers