The Stack
Canonical7papers using it
20,071HF downloads
1,020HF likes
2023first seen
Dataset Card for The Stack Changelog Release Description v1.0 Initial release of the Stack. Included 30 programming languages and 18 permissive licenses. Note: Three included licenses (MPL/EPL/LGPL) are considered weak copyleft licenses. The resulting near-deduplicated dataset is 3TB in size. v1.1 The three copyleft li
π€ Hugging Faceβ other
Papers using The Stack (7)
- What Makes Code Generation Ethically Sourced?StarCoder: may the source be with you!SantaCoder: don't reach for the stars!Knowledge Transfer from High-Resource to Low-Resource Programming
Languages for Code LLMsDecoding Data Quality via Synthetic Corruptions: Embedding-guided
Pruning of Code DataKotlin ML Pack: Technical ReportStarCoder: may the source be with you!