Fuselip: Multimodal Embeddings Via Early Fusion Of Discrete Tokens
2025 Β· Christian Schlarmann, Francesco Croce, Nicolas Flammarion, et al.
Abstract
Contrastive language-image pre-training aligns the features of text-image pairs in a common latent space via distinct encoders for each modality. While this approach achieves impressive performance in several zero-shot tasks, it cannot natively handle multimodal inputs, i.e., encoding image and text into a single feature vector. As a remedy, it is common practice to use additional modules to merge the features extracted by the unimodal encoders. In this work, we present FuseLIP, an alternative architecture for multimodal embedding. Leveraging recent progress in discrete image tokenizers, we propose to use a single transformer model which operates on an extended vocabulary of text and image tokens. This early fusion approach allows the different modalities to interact at each depth of encoding and obtain richer representations compared to common late fusion. We collect new datasets for multimodal pre-training and evaluation, designing challenging tasks for multimodal encoder models. We
Authors
(none)
Tags
Stats
Related papers
- Data-efficient Multimodal Fusion On A Single GPU (2023)10.00
- The More, The Merrier: Contrastive Fusion For Higher-order Multimodal Alignment (2025)0.00
- Breaking The Modality Barrier: Universal Embedding Learning With Multimodal Llms (2025)4.52
- Generating Images With Multimodal Language Models (2023)6.77
- ITO: Images And Texts As One Via Synergizing Multiple Alignment And Training-time Fusion (2026)0.00
- Everything At Once -- Multi-modal Fusion Transformer For Video Retrieval (2021)15.78
- Come-vl: Scaling Complementary Multi-encoder Vision-language Learning (2026)0.00
- Coarse-to-fine Vision-language Pre-training With Fusion In The Backbone (2022)12.05