Distinctive Image Captioning: Leveraging Ground Truth Captions In CLIP Guided Reinforcement Learning
2024 Β· Antoine Chaffin, Ewa Kijak, Vincent Claveau
Abstract
Training image captioning models using teacher forcing results in very generic samples, whereas more distinctive captions can be very useful in retrieval applications or to produce alternative texts describing images for accessibility. Reinforcement Learning (RL) allows to use cross-modal retrieval similarity score between the generated caption and the input image as reward to guide the training, leading to more distinctive captions. Recent studies show that pre-trained cross-modal retrieval models can be used to provide this reward, completely eliminating the need for reference captions. However, we argue in this paper that Ground Truth (GT) captions can still be useful in this RL framework. We propose a new image captioning model training strategy that makes use of GT captions in different ways. Firstly, they can be used to train a simple MLP discriminator that serves as a regularization to prevent reward hacking and ensures the fluency of generated captions, resulting in a textual G
Authors
(none)
Tags
Stats
Related papers
- Modeling Caption Diversity In Contrastive Vision-language Pretraining (2024)0.00
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Linear Alignment Of Vision-language Models For Image Captioning (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33
- Captured By Captions: On Memorization And Its Mitigation In CLIP Models (2025)0.00
- Deep Image Representations Using Caption Generators (2017)0.00
- Dreamlip: Language-image Pre-training With Long Captions (2024)10.61