Efficient And Versatile Robust Fine-tuning Of Zero-shot Models
2024 Β· Sungyeon Kim, Boseung Jeong, Donghyun Kim, et al.
Abstract
Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for fine-tuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval
Authors
(none)
Tags
Stats
Related papers
- Uniadapter: Unified Parameter-efficient Transfer Learning For Cross-modal Modeling (2023)3.77
- Multiway-adapater: Adapting Large-scale Multi-modal Models For Scalable Image-text Retrieval (2023)0.00
- Cross-modal Adapter: Parameter-efficient Transfer Learning Approach For Vision-language Models (2024)6.77
- Ucdr-adapter: Exploring Adaptation Of Pre-trained Vision-language Models For Universal Cross-domain Retrieval (2024)4.52
- M2-RAAP: A Multi-modal Recipe For Advancing Adaptation-based Pre-training Towards Effective And Efficient Zero-shot Video-text Retrieval (2024)6.76
- Mv-adapter: Multimodal Video Transfer Learning For Video Text Retrieval (2023)9.76
- Fitclip: Refining Large-scale Pretrained Image-text Models For Zero-shot Video Understanding Tasks (2022)1.91
- Modality And Task Adaptation For Enhanced Zero-shot Composed Image Retrieval (2024)0.00