Cross-modal Adapter: Parameter-efficient Transfer Learning Approach For Vision-language Models
2024 Β· Juncheng Yang, Zuchao Li, Shuai Xie, et al.
Abstract
Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model
Authors
(none)
Tags
Stats
Related papers
- Uniadapter: Unified Parameter-efficient Transfer Learning For Cross-modal Modeling (2023)3.77
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Multiway-adapater: Adapting Large-scale Multi-modal Models For Scalable Image-text Retrieval (2023)0.00
- Ucdr-adapter: Exploring Adaptation Of Pre-trained Vision-language Models For Universal Cross-domain Retrieval (2024)4.52
- Queryadapter: Rapid Adaptation Of Vision-language Models In Response To Natural Language Queries (2025)0.00
- Understanding Retrieval-augmented Task Adaptation For Vision-language Models (2024)0.00
- Dynamic Adapter With Semantics Disentangling For Cross-lingual Cross-modal Retrieval (2024)2.26
- Efficient And Versatile Robust Fine-tuning Of Zero-shot Models (2024)4.52