Towards Multi-modal Mastery: A 4.5B Parameter Truly Multi-modal Small Language Model
2024 · Ben Koska, Mojmír Horváth
Abstract
We present a novel 4.5B parameter small language model that can handle multiple input and output modalities, including text, images, videos, and audio. Despite its small size, the model achieves near state-of-the-art performance on a variety of tasks, demonstrating the potential of multi-modal models to tackle complex real-world problems. Our approach leverages recent advancements in language modeling and multi-task learning to create a versatile and high-performing model that can even be deployed for edge inference. Experimental results show the model's strong performance across multiple benchmarks, paving the way for further progress in multi-modal artificial intelligence.
Authors
(none)
Tags
Stats
Related papers
- Mmmmodal -- Multi-images Multi-audio Multi-turn Multi-modal (2024)0.00
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Multimodal Large Language Models: A Survey (2023)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Phi-4-mini Technical Report: Compact Yet Powerful Multimodal Language Models Via Mixture-of-loras (2025)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Speechgpt: Empowering Large Language Models With Intrinsic Cross-modal Conversational Abilities (2023)16.59
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00