Mixgen: A New Multi-modal Data Augmentation
2022 Β· Xiaoshuai Hao, Yi Zhu, Srikar Appalaraju, et al.
Abstract
Data augmentation is a necessity to enhance data efficiency in deep learning. For vision-language pre-training, data is only augmented either for images or for text in previous works. In this paper, we present MixGen: a joint data augmentation for vision-language representation learning to further improve data efficiency. It generates new image-text pairs with semantic relationships preserved by interpolating images and concatenating text. It's simple, and can be plug-and-played into existing pipelines. We evaluate MixGen on four architectures, including CLIP, ViLT, ALBEF and TCL, across five downstream vision-language tasks to show its versatility and effectiveness. For example, adding MixGen in ALBEF pre-training leads to absolute performance improvements on downstream tasks: image-text retrieval (+6.2% on COCO fine-tuned and +5.3% on Flicker30K zero-shot), visual grounding (+0.9% on RefCOCO+), visual reasoning (+$0.9% on NLVR2), visual question answering (+0.3% on VQA2.0), and visua
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Attribute Insertions For Assessing The Robustness Of Vision-and-language Learning (2023)2.00
- A Feature-space Multimodal Data Augmentation Technique For Text-video Retrieval (2022)12.43
- Paired Cross-modal Data Augmentation For Fine-grained Image-to-text Retrieval (2022)8.09
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- FLEX-CLIP: Feature-level Generation Network Enhanced CLIP For X-shot Cross-modal Retrieval (2024)0.00
- Realsyn: An Effective And Scalable Multimodal Interleaved Document Transformation Paradigm (2025)3.20
- Data-efficient Multimodal Fusion On A Single GPU (2023)10.00
- Look, Imagine And Match: Improving Textual-visual Cross-modal Retrieval With Generative Models (2017)18.52