Dualcap: Enhancing Lightweight Image Captioning Via Dual Retrieval With Similar Scenes Visual Prompts
2025 Β· Binbin Li, Guimiao Yang, Zisen Qi, et al.
Abstract
Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose \(DualCap\), a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable pa
Authors
(none)
Tags
Stats
Related papers
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Dual Prompt Learning For Adapting Vision-language Models To Downstream Image-text Retrieval (2025)0.00
- Understanding Retrieval Robustness For Retrieval-augmented Image Captioning (2024)6.34
- Caption-matching: A Multimodal Approach For Cross-domain Image Retrieval (2024)0.00
- Retrieval-augmented Image Captioning (2023)11.29
- Adapting Dual-encoder Vision-language Models For Paraphrased Retrieval (2024)0.00
- Deep Image Representations Using Caption Generators (2017)0.00
- PC\(^2\): Pseudo-classification Based Pseudo-captioning For Noisy Correspondence Learning In Cross-modal Retrieval (2024)9.23