Infusing Fine-grained Visual Knowledge To Vision-language Models
2025 Β· Nikolaos-Antonios Ypsilantis, Kaifeng Chen, AndrΓ© Araujo, et al.
Abstract
Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model's general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM's broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed at knowledge retention and propose an efficient and effective combination strategy. Additionally, we
Authors
(none)
Tags
Stats
Related papers
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- Understanding Retrieval-augmented Task Adaptation For Vision-language Models (2024)0.00
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00
- Coarse-to-fine Vision-language Pre-training With Fusion In The Backbone (2022)12.05
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48