← all datasets

Wikipedia

Emerging
2papers using it
120,656HF downloads
645HF likes
2025first seen

Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (https://dumps.wikimedia.org/) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).

Papers using Wikipedia (1)

Wikipedia β€” datasets β€” learning-to-hash