EVE: Efficient Vision-language Pre-training With Masked Prediction And Modality-aware Moe
2023 Β· Junyi Chen, Longteng Guo, Jia Sun, et al.
Abstract
Building scalable vision-language models to learn from diverse, multimodal data remains an open challenge. In this paper, we introduce an Efficient Vision-languagE foundation model, namely EVE, which is one unified multimodal Transformer pre-trained solely by one unified pre-training task. Specifically, EVE encodes both vision and language within a shared Transformer network integrated with modality-aware sparse Mixture-of-Experts (MoE) modules, which capture modality-specific information by selectively switching to different experts. To unify pre-training tasks of vision and language, EVE performs masked signal modeling on image-text pairs to reconstruct masked signals, i.e., image pixels and text tokens, given visible signals. This simple yet effective pre-training objective accelerates training by 3.5x compared to the model pre-trained with Image-Text Contrastive and Image-Text Matching losses. Owing to the combination of the unified architecture and pre-training task, EVE is easy t
Authors
(none)
Tags
Stats
Related papers
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- Unicoder-vl: A Universal Encoder For Vision And Language By Cross-modal Pre-training (2019)20.24
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Unifying Vision-language Representation Space With Single-tower Transformer (2022)7.16
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00
- UFO: A Unified Transformer For Vision-language Representation Learning (2021)0.00