Mplug: Effective And Efficient Vision-language Learning By Cross-modal Skip-connections
2022 Β· Chenliang Li, Haiyang Xu, Junfeng Tian, et al.
Abstract
Large-scale pretrained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from the problems of low computational efficiency and information asymmetry brought by the long visual sequence in cross-modal alignment. To address these problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections, which creates inter-layer shortcuts that skip a certain number of layers for time-consuming full self-attention on the vision side. mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text
Authors
(none)
Tags
Stats
Related papers
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- M2-encoder: Advancing Bilingual Image-text Understanding By Large-scale Efficient Pretraining (2024)0.00
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Leveraging Data To Say No: Memory Augmented Plug-and-play Selective Prediction (2026)0.78
- MULE: Multimodal Universal Language Embedding (2019)9.03
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- Hyperdimensional Cross-modal Alignment Of Frozen Language And Image Models For Efficient Image Captioning (2026)0.00
- COTS: Collaborative Two-stream Vision-language Pre-training Model For Cross-modal Retrieval (2022)13.60