Turkcolbert: A Benchmark Of Dense And Late-interaction Models For Turkish Information Retrieval
2025 · Özay Ezerceli, Mahmoud El Hussieni, Selva Taş, et al.
Abstract
Neural information retrieval systems excel in high-resource languages but remain underexplored for morphologically rich, lower-resource languages such as Turkish. Dense bi-encoders currently dominate Turkish IR, yet late-interaction models -- which retain token-level representations for fine-grained matching -- have not been systematically evaluated. We introduce TurkColBERT, the first comprehensive benchmark comparing dense encoders and late-interaction models for Turkish retrieval. Our two-stage adaptation pipeline fine-tunes English and multilingual encoders on Turkish NLI/STS tasks, then converts them into ColBERT-style retrievers using PyLate trained on MS MARCO-TR. We evaluate 10 models across five Turkish BEIR datasets covering scientific, financial, and argumentative domains. Results show strong parameter efficiency: the 1.0M-parameter colbert-hash-nano-tr is 600\(\times\) smaller than the 600M turkish-e5-large dense encoder while preserving over 71% of its average mAP. Late-in
Authors
(none)
Tags
Stats
Related papers
- Colbertv2: Effective And Efficient Retrieval Via Lightweight Late Interaction (2021)17.46
- Colbert-att: Late-interaction Meets Attention For Enhanced Retrieval (2026)0.00
- Optimized Text Embedding Models And Benchmarks For Amharic Passage Retrieval (2025)4.94
- Colbert: Efficient And Effective Passage Search Via Contextualized Late Interaction Over BERT (2020)0.00
- Jina-colbert-v2: A General-purpose Multilingual Late Interaction Retriever (2024)5.24
- Introducing Neural Bag Of Whole-words With Colberter: Contextualized Late Interactions Using Enhanced Reduction (2022)0.00
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Pylate: Flexible Training And Retrieval For Late Interaction Models (2025)3.58