FineWeb-Edu
Emerging4papers using it
462,571HF downloads
1,150HF likes
2025first seen
π FineWeb-Edu 1.3 trillion tokens of the finest educational data the π web has to offer Paper: https://arxiv.org/abs/2406.17557 What is it? π FineWeb-Edu dataset consists of 1.3T tokens and 5.4T tokens (FineWeb-Edu-score-2) of educational web pages filtered from π· FineWeb dataset. This is the 1.3 trillion version.
π€ Hugging Faceβ odc-by
Papers using FineWeb-Edu (4)
- Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language ModelsOPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every IterationAttention Needs to Focus: A Unified Perspective on Attention AllocationFiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition