MSTAR: Box-free Multi-query Scene Text Retrieval With Attention Recycling
2025 Β· Liang Yin, Xudong Xie, Zhang Li, et al.
Abstract
Scene text retrieval has made significant progress with the assistance of accurate text localization. However, existing approaches typically require costly bounding box annotations for training. Besides, they mostly adopt a customized retrieval strategy but struggle to unify various types of queries to meet diverse retrieval needs. To address these issues, we introduce Muti-query Scene Text retrieval with Attention Recycling (MSTAR), a box-free approach for scene text retrieval. It incorporates progressive vision embedding to dynamically capture the multi-grained representation of texts and harmonizes free-style text queries with style-aware instructions. Additionally, a multi-instance matching module is integrated to enhance vision-language alignment. Furthermore, we build the Multi-Query Text Retrieval (MQTR) dataset, the first benchmark designed to evaluate the multi-query scene text retrieval capability of models, comprising four query types and 16k images. Extensive experiments de
Authors
(none)
Tags
Stats
Related papers
- Stacmr: Scene-text Aware Cross-modal Retrieval (2020)10.48
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)16.16
- Multi-query Video Retrieval (2022)9.59
- Sa-person: Text-based Person Retrieval With Scene-aware Re-ranking (2025)0.00
- Multi-modal Reasoning Graph For Scene-text Based Fine-grained Image Classification And Retrieval (2020)11.29
- Monster: A Unified Model For Motion, Scene, Text Retrieval (2025)0.00
- Focus, Distinguish, And Prompt: Unleashing CLIP For Efficient And Flexible Scene Text Retrieval (2024)8.80
- Mumur : Multilingual Multimodal Universal Retrieval (2022)2.26