UFO: A Unified Transformer For Vision-language Representation Learning
2021 Β· Jianfeng Wang, Xiaowei Hu, Zhe Gan, et al.
Abstract
In this paper, we propose a single UniFied transfOrmer (UFO), which is capable of processing either unimodal inputs (e.g., image or language) or multimodal inputs (e.g., the concatenation of the image and the question), for vision-language (VL) representation learning. Existing approaches typically design an individual network for each modality and/or a specific fusion network for multimodal tasks. To simplify the network architecture, we use a single transformer network and enforce multi-task learning during VL pre-training, which includes the image-text contrastive loss, image-text matching loss, and masked language modeling loss based on the bidirectional and the seq2seq attention mask. The same transformer network is used as the image encoder, the text encoder, or the fusion network in different pre-training tasks. Empirically, we observe less conflict among different tasks and achieve new state of the arts on visual question answering, COCO image captioning (cross-entropy optimiza
Authors
(none)
Tags
Stats
Related papers
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Unifying Vision-language Representation Space With Single-tower Transformer (2022)7.16
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- UC2: Universal Cross-lingual Cross-modal Vision-and-language Pre-training (2021)13.05
- Universal Vision-language Dense Retrieval: Learning A Unified Representation Space For Multi-modal Retrieval (2022)3.45
- MULE: Multimodal Universal Language Embedding (2019)9.03
- EVE: Efficient Vision-language Pre-training With Masked Prediction And Modality-aware Moe (2023)7.50
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34