Linear Alignment Of Vision-language Models For Image Captioning
2023 Β· Fabian Paischer, Markus Hofmarcher, Sepp Hochreiter, et al.
Abstract
Recently, vision-language models like CLIP have advanced the state of the art in a variety of multi-modal tasks including image captioning and caption evaluation. Many approaches leverage CLIP for cross-modal retrieval to condition pre-trained language models on visual input. However, CLIP generally suffers from a mis-alignment of image and text modalities in the joint embedding space. We investigate efficient methods to linearly re-align the joint embedding space for the downstream task of image captioning. This leads to an efficient training protocol that merely requires computing a closed-form solution for a linear mapping in the joint CLIP space. Consequently, we propose a lightweight captioning method called ReCap, which can be trained up to 1000 times faster than existing lightweight methods. Moreover, we propose two new learning-based image-captioning metrics built on CLIP score along with our proposed alignment. We evaluate ReCap on MS-COCO, Flickr30k, VizWiz and MSRVTT. On the
Authors
(none)
Tags
Stats
Related papers
- Clip-lite: Information Efficient Visual Representation Learning With Language Supervision (2021)2.35
- Optimizing CLIP Models For Image Retrieval With Maintained Joint-embedding Alignment (2024)6.34
- CLIPS: An Enhanced CLIP Framework For Learning With Synthetic Captions (2024)0.00
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Advancing Myopia To Holism: Fully Contrastive Language-image Pre-training (2024)0.00
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Robust Cross-modal Representation Learning With Progressive Self-distillation (2022)12.33