Recap: Event-aware Image Captioning With Article Retrieval And Semantic Gaussian Normalization
2025 Β· Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, et al.
Abstract
Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption ge
Authors
(none)
Tags
Stats
Related papers
- Event-retriever: Event-aware Multimodal Image Retrieval For Realistic Captions (2025)0.00
- Zse-cap: A Zero-shot Ensemble For Image Retrieval And Prompt-guided Captioning (2025)0.00
- Towards Retrieval-augmented Architectures For Image Captioning (2024)9.41
- Knowledge Completes The Vision: A Multimodal Entity-aware Retrieval-augmented Generation Framework For News Image Captioning (2025)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- Dualcap: Enhancing Lightweight Image Captioning Via Dual Retrieval With Similar Scenes Visual Prompts (2025)0.00
- Large Language Models For Captioning And Retrieving Remote Sensing Images (2024)0.00
- Deep Image Representations Using Caption Generators (2017)0.00