Coarse-to-fine Vision-language Pre-training With Fusion In The Backbone
2022 Β· Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, et al.
Abstract
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both thes
Authors
(none)
Tags
Stats
Related papers
- Infusing Fine-grained Visual Knowledge To Vision-language Models (2025)0.00
- Finevit: Progressively Unlocking Fine-grained Perception With Dense Recaptions (2026)0.00
- Vldeformer: Vision-language Decomposed Transformer For Fast Cross-modal Retrieval (2021)10.21
- Fuselip: Multimodal Embeddings Via Early Fusion Of Discrete Tokens (2025)0.00
- Vlmo: Unified Vision-language Pre-training With Mixture-of-modality-experts (2021)6.34
- Unsupervised Vision-and-language Pre-training Via Retrieval-based Multi-granular Alignment (2022)10.48
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Fashionfae: Fine-grained Attributes Enhanced Fashion Vision-language Pre-training (2024)0.00