Grounding Language Models To Images For Multimodal Inputs And Outputs
2023 Β· Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried
Abstract
We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.
Authors
(none)
Tags
Stats
Related papers
- Generating Images With Multimodal Language Models (2023)6.77
- Mllms-augmented Visual-language Representation Learning (2023)0.00
- Learning Visually Grounded Sentence Representations (2017)7.81
- Imagebert: Cross-modal Pre-training With Large-scale Weak-supervised Image-text Data (2020)0.00
- Hyperdimensional Cross-modal Alignment Of Frozen Language And Image Models For Efficient Image Captioning (2026)0.00
- Towards Zero-shot Cross-lingual Image Retrieval And Tagging (2021)2.46
- Align2ground: Weakly Supervised Phrase Grounding Guided By Image-caption Alignment (2019)13.93
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60