Less Is More: Adapting Text Embeddings For Low-resource Languages With Small Scale Noisy Synthetic Data
2026 Β· Zaruhi Navasardyan, Spartak Bughdaryan, Bagrat Minasyan, et al.
Abstract
Low-resource languages (LRLs) often lack high-quality, large-scale datasets for training effective text embedding models, hindering their application in tasks like retrieval-augmented generation (RAG) and semantic search. In this work, we challenge the prevailing assumption that effective semantic alignment requires massive datasets or pristine, human-verified translations. Focusing on Armenian (an LRL with a unique script), we introduce a cost-effective adaptation strategy using small scale noisy synthetic data generated by translating English Reddit title-body pairs with open-weights models. We establish a comprehensive evaluation benchmark comprising existing datasets, translated data, and a manually curated dataset. Our experiments reveal a surprising "Less is More" phenomenon: fine-tuning a multilingual encoder (mE5) on just 10,000 noisy synthetic pairs yields 11-12% average improvements across the benchmark with a 20%+ relative improvement in retrieval performance, matching the p
Authors
(none)
Tags
Stats
Related papers
- Less Is More: Pre-train A Strong Text Encoder For Dense Retrieval Using A Weak Decoder (2021)14.29
- LMAR: Language Model Augmented Retriever For Domain-specific Knowledge Indexing (2025)1.57
- Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval (2025)4.94
- Lowclip: Adapting The CLIP Model Architecture For Low-resource Languages In Multimodal Image Retrieval Task (2024)0.00
- Multi-lingual Malaysian Embedding: Leveraging Large Language Models For Semantic Representations (2024)0.00
- Don't Retrieve, Generate: Prompting Llms For Synthetic Training Data In Dense Retrieval (2025)0.00
- Rethinking Hybrid Retrieval: When Small Embeddings And LLM Re-ranking Beat Bigger Models (2025)0.00
- Noisy Self-training With Synthetic Queries For Dense Retrieval (2023)0.00