Amharicir+instr: A Two-dataset Resource For Neural Retrieval And Instruction Tuning
2026 Β· Tilahun Yeshambel, Moncef Garouani, Josiane Mothe
Abstract
Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instructi
Authors
(none)
Tags
Stats
Related papers
- Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval (2025)4.94
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Noisy Self-training With Synthetic Queries For Dense Retrieval (2023)0.00
- Less Is More: Adapting Text Embeddings For Low-resource Languages With Small Scale Noisy Synthetic Data (2026)0.00
- Mr. Tydi: A Multi-lingual Benchmark For Dense Retrieval (2021)14.80
- Mfollowir: A Multilingual Benchmark For Instruction Following In Retrieval (2025)0.00
- Boosting Data Utilization For Multilingual Dense Retrieval (2025)0.00
- Team IELAB At TREC Clinical Trial Track 2023: Enhancing Clinical Trial Retrieval With Neural Rankers And Large Language Models (2024)0.00