Webfaq: A Multilingual Collection Of Natural Q&A Datasets For Dense Retrieval
2025 Β· Michael Dinzinger, Laura Caspari, Kanishka Ghosh Dastidar, et al.
Abstract
We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond WebFAQ - to other multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not least, we
Authors
(none)
Tags
Stats
Related papers
- Towards Universal Dense Retrieval For Open-domain Question Answering (2021)0.00
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- A Systematic Study Of Retrieval Pipeline Design For Retrieval-augmented Medical Question Answering (2026)0.00
- QAEA-DR: A Unified Text Augmentation Framework For Dense Retrieval (2024)5.24
- Nepali Passport Question Answering: A Low-resource Dataset For Public Service Applications (2026)0.00
- An Interactive Multi-modal Query Answering System With Retrieval-augmented Large Language Models (2024)5.84
- Learning To Retrieve Passages Without Supervision (2021)8.09
- Fine-grained Late-interaction Multi-modal Retrieval For Retrieval Augmented Visual Question Answering (2023)5.24