Multimodal Large Language Models: A Survey
2023 Β· Jiayang Wu, Wensheng Gan, Zefeng Chen, et al.
Abstract
The exploration of multimodal language models integrates multiple data types, such as images, text, language, audio, and other heterogeneity. While the latest large language models excel in text-based tasks, they often struggle to understand and process other data types. Multimodal models address this limitation by combining various modalities, enabling a more comprehensive understanding of diverse data. This paper begins by defining the concept of multimodal and examining the historical development of multimodal algorithms. Furthermore, we introduce a range of multimodal products, focusing on the efforts of major technology companies. A practical guide is provided, offering insights into the technical aspects of multimodal models. Moreover, we present a compilation of the latest algorithms and commonly used datasets, providing researchers with valuable resources for experimentation and evaluation. Lastly, we explore the applications of multimodal models and discuss the challenges asso
Authors
(none)
Tags
Stats
Related papers
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Multimodal Machine Translation Through Visuals And Speech (2019)12.68
- Chatbridge: Bridging Modalities With Large Language Model As A Language Catalyst (2023)0.00
- Training-free Multimodal Large Language Model Orchestration (2025)0.00
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Towards Multi-modal Mastery: A 4.5B Parameter Truly Multi-modal Small Language Model (2024)2.26