12-in-1: Multi-task Vision And Language Representation Learning
2019 Β· Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, et al.
Abstract
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specif
Authors
(none)
Tags
Stats
Related papers
- Vilbert: Pretraining Task-agnostic Visiolinguistic Representations For Vision-and-language Tasks (2019)0.00
- CAVL: Learning Contrastive And Adaptive Representations Of Vision And Language (2023)0.00
- Is Multimodal Vision Supervision Beneficial To Language? (2023)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- Fame-vil: Multi-tasking Vision-language Model For Heterogeneous Fashion Tasks (2023)15.69
- RAVEN: Multitask Retrieval Augmented Vision-language Learning (2024)0.00
- Multilingual Diversity Improves Vision-language Representations (2024)2.26
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05