Stacmr: Scene-text Aware Cross-modal Retrieval
2020 · Andrés Mafla, Rafael Sampaio de Rezende, Lluís Gómez, et al.
Abstract
Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight
Authors
(none)
Tags
Stats
Related papers
- Scene-centric Vs. Object-centric Image-text Cross-modal Retrieval: A Reproducibility Study (2023)5.24
- Beyond Visual Semantics: Exploring The Role Of Scene Text In Image Understanding (2019)9.59
- Scene Text Retrieval Via Joint Text Detection And Similarity Learning (2021)16.16
- MSTAR: Box-free Multi-query Scene Text Retrieval With Attention Recycling (2025)2.00
- Multi-modal Reasoning Graph For Scene-text Based Fine-grained Image Classification And Retrieval (2020)11.29
- Scene Graph Based Image Retrieval -- A Case Study On The CLEVR Dataset (2019)0.00
- Sa-person: Text-based Person Retrieval With Scene-aware Re-ranking (2025)0.00
- Vista: Vision And Scene Text Aggregation For Cross-modal Retrieval (2022)14.31