A Review Of Multi-modal Large Language And Vision Models
2024 Β· Kilian Carolan, Laura Fennelly, Alan F. Smeaton
Abstract
Large Language Models (LLMs) have recently emerged as a focal point of research and application, driven by their unprecedented ability to understand and generate text with human-like quality. Even more recently, LLMs have been extended into multi-modal large language models (MM-LLMs) which extends their capabilities to deal with image, video and audio information, in addition to text. This opens up applications like text-to-video generation, image captioning, text-to-speech, and more and is achieved either by retro-fitting an LLM with multi-modal capabilities, or building a MM-LLM from scratch. This paper provides an extensive review of the current state of those LLMs with multi-modal capabilities as well as the very recent MM-LLMs. It covers the historical development of LLMs especially the advances enabled by transformer-based architectures like OpenAI's GPT series and Google's BERT, as well as the role of attention mechanisms in enhancing model performance. The paper includes covera
Authors
(none)
Tags
Stats
Related papers
- Llms Meet Multimodal Generation And Editing: A Survey (2024)5.48
- Multimodal Large Language Models: A Survey (2023)0.00
- X-LLM: Bootstrapping Advanced Large Language Models By Treating Multi-modalities As Foreign Languages (2023)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Macaw-llm: Multi-modal Language Modeling With Image, Audio, Video, And Text Integration (2023)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00