Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval
2025 Β· Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Abstract
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to
Authors
(none)
Tags
Stats
Related papers
- Turkcolbert: A Benchmark Of Dense And Late-interaction Models For Turkish Information Retrieval (2025)0.00
- Amharicir+instr: A Two-dataset Resource For Neural Retrieval And Instruction Tuning (2026)0.00
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- Arctic-embed 2.0: Multilingual Retrieval Without Compromise (2024)0.00
- Less Is More: Adapting Text Embeddings For Low-resource Languages With Small Scale Noisy Synthetic Data (2026)0.00
- Arctic-embed: Scalable, Efficient, And Accurate Text Embedding Models (2024)0.00
- Aggretriever: A Simple Approach To Aggregate Textual Representations For Robust Dense Passage Retrieval (2022)13.22