X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages
2023 Β· Feilong Chen, Minglun Han, Haozhi Zhao, et al.
Abstract
Large language models (LLMs) have demonstrated remarkable language abilities. GPT-4, based on advanced LLMs, exhibits extraordinary multimodal capabilities beyond previous visual language models. We attribute this to the use of more advanced LLMs compared with previous multimodal models. Unfortunately, the model architecture and training strategies of GPT-4 are unknown. To endow LLMs with multimodal capabilities, we propose X-LLM, which converts Multi-modalities (images, speech, videos) into foreign languages using X2L interfaces and inputs them into a large Language model (ChatGLM). Specifically, X-LLM aligns multiple frozen single-modal encoders and a frozen LLM using X2L interfaces, where ``X'' denotes multi-modalities such as image, speech, and videos, and ``L'' denotes languages. X-LLM's training consists of three stages: (1) Converting Multimodal Information: The first stage trains each X2L interface to align with its respective single-modal encoder separately to convert multimod
Authors
(none)
Tags
Stats
Related papers
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Speechgpt: Empowering Large Language Models With Intrinsic Cross-modal Conversational Abilities (2023)16.59
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59