Jina-colbert-v2: A General-purpose Multilingual Late Interaction Retriever
2024 · Rohan Jha, Bo Wang, Michael Günther, et al.
Abstract
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.
Authors
(none)
Tags
Stats
Related papers
- Colbertv2: Effective And Efficient Retrieval Via Lightweight Late Interaction (2021)17.46
- Colbert-att: Late-interaction Meets Attention For Enhanced Retrieval (2026)0.00
- Introducing Neural Bag Of Whole-words With Colberter: Contextualized Late Interactions Using Enhanced Reduction (2022)0.00
- Colbert: Efficient And Effective Passage Search Via Contextualized Late Interaction Over BERT (2020)0.00
- Transfer Learning Approaches For Building Cross-language Dense Retrieval Models (2022)10.97
- Colbert-xm: A Modular Multi-vector Representation Model For Zero-shot Multilingual Information Retrieval (2024)0.00
- Pylate: Flexible Training And Retrieval For Late Interaction Models (2025)3.58
- Turkcolbert: A Benchmark Of Dense And Late-interaction Models For Turkish Information Retrieval (2025)0.00