Anygpt: Unified Multimodal LLM With Discrete Sequence Modeling
2024 Β· Jun Zhan, Junqi Dai, Jiasheng Ye, et al.
Abstract
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achiev
Authors
(none)
Tags
Stats
Related papers
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Speechgpt: Empowering Large Language Models With Intrinsic Cross-modal Conversational Abilities (2023)16.59
- X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages (2023)0.00
- MIO: A Foundation Model On Multimodal Tokens (2024)3.58
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Multimodal Large Language Models: A Survey (2023)0.00