Nepali Passport Question Answering: A Low-resource Dataset For Public Service Applications
2026 Β· Funghang Limbu Begha, Praveen Acharya, Bal Krishna Bal
Abstract
Nepali, a low-resource language, faces significant challenges in building an effective information retrieval system due to the unavailability of annotated data and computational linguistic resources. In this study, we attempt to address this gap by preparing a pair-structured Nepali Question-Answer dataset. We focus on Frequently Asked Questions (FAQs) for passport-related services, building a data set for training and evaluation of IR models. In our study, we have fine-tuned transformer-based embedding models for semantic similarity in question-answer retrieval. The fine-tuned models were compared with the baseline BM25. In addition, we implement a hybrid retrieval approach, integrating fine-tuned models with BM25, and evaluate the performance of the hybrid retrieval. Our results show that the fine-tuned SBERT-based models outperform BM25, whereas multilingual E5 embedding-based models achieve the highest retrieval performance among all evaluated models.
Authors
(none)
Tags
Stats
Related papers
- Webfaq: A Multilingual Collection Of Natural Q&A Datasets For Dense Retrieval (2025)0.00
- Visr-bench: An Empirical Study On Visual Retrieval-augmented Generation For Multilingual Long Document Understanding (2025)0.00
- Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval (2025)4.94
- A Systematic Study Of Retrieval Pipeline Design For Retrieval-augmented Medical Question Answering (2026)0.00
- Enhancing Question Answering Precision With Optimized Vector Retrieval And Instructions (2024)0.00
- Text Embeddings For Retrieval From A Large Knowledge Base (2018)4.52
- Pre-training Tasks For Embedding-based Large-scale Retrieval (2020)0.00
- Amharicir+instr: A Two-dataset Resource For Neural Retrieval And Instruction Tuning (2026)0.00