Llms Meet Multimodal Generation And Editing: A Survey
2024 Β· Yingqing He, Zhaoyang Liu, Jingye Chen, et al.
Abstract
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects.
Authors
(none)
Tags
Stats
Related papers
- Multimodal Large Language Models: A Survey (2023)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- M\(^{2}\)ugen: Multi-modal Music Understanding And Generation With The Power Of Large Language Models (2023)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages (2023)0.00
- Multimodal Large Language Models For End-to-end Affective Computing: Benchmarking And Boosting With Generative Knowledge Prompting (2025)0.00
- Training-free Multimodal Large Language Model Orchestration (2025)0.00