LANGSAE EDITING: Improving Multilingual Information Retrieval Via Post-hoc Language Identity Removal
2026 Β· Dongjun Kim, Jeongho Yoon, Chanjun Park, et al.
Abstract
Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.
Authors
(none)
Tags
Stats
Related papers
- Learning Retrieval Models With Sparse Autoencoders (2026)0.00
- Interpret And Control Dense Retrieval With Sparse Latent Features (2024)2.26
- Boosting Data Utilization For Multilingual Dense Retrieval (2025)0.00
- CLEAR: Cross-lingual Enhancement In Alignment Via Reverse-training (2026)0.78
- Llm-augmented Retrieval: Enhancing Retrieval Models Through Language Models And Doc-level Embedding (2024)0.00
- CSPLADE: Learned Sparse Retrieval With Causal Language Models (2025)0.00
- LUSIFER: Language Universal Space Integration For Enhanced Multilingual Embeddings With Large Language Models (2025)0.00
- Blind To Position, Biased In Language: Probing Mid-layer Representational Bias In Vision-language Encoders For Zero-shot Language-grounded Spatial Understanding (2025)0.00