Multimodal Latent Language Modeling With Next-token Diffusion
2024 Β· Yutao Sun, Hangbo Bao, Wenhui Wang, et al.
Abstract
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop \(\sigma\)-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that L
Authors
(none)
Tags
Stats
Related papers
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Longcat-next: Lexicalizing Modalities As Discrete Tokens (2026)6.80
- Multimodal Large Language Models: A Survey (2023)0.00
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- Audio-enhanced Vision-language Modeling With Latent Space Broadening For High Quality Data Expansion (2025)0.00
- Conditional Latent Diffusion-based Speech Enhancement Via Dual Context Learning (2025)10.81
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91