Next-gpt: Any-to-any Multimodal LLM
2023 Β· Shengqiong Wu, Hao Fei, Leigang Qu, et al.
Abstract
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansi
Authors
(none)
Tags
Stats
Related papers
- Anygpt: Unified Multimodal LLM With Discrete Sequence Modeling (2024)0.00
- X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages (2023)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- MIO: A Foundation Model On Multimodal Tokens (2024)3.58
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Training-free Multimodal Large Language Model Orchestration (2025)0.00