Visual Words Meet BM25: Sparse Auto-encoder Visual Word Scoring For Image Retrieval

·2026

arXiv:han2026visual ↗Google Scholar ↗Semantic Scholar ↗

Abstract

Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf\{BM25-V\}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 \(\geq\) 0.993, enabling a two-stage pipeline that reranks only \(K\{=\}200\) candidates per query and recovers near-dense accuracy within \(0.2\)% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven

Abstract

Related papers