S3: A Simple Strong Sample-effective Multimodal Dialog System
2024 Β· Elisei Rykov, Egor Malkershin, Alexander Panchenko
Abstract
In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
Authors
(none)
Tags
Stats
Related papers
- Data-centric Improvements For Enhancing Multi-modal Understanding In Spoken Conversation Modeling (2024)0.00
- Towards Multi-modal Mastery: A 4.5B Parameter Truly Multi-modal Small Language Model (2024)2.26
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Speakerlm: End-to-end Versatile Speaker Diarization And Recognition With Multimodal Large Language Models (2025)5.24
- SLM-S2ST: A Multimodal Language Model For Direct Speech-to-speech Translation (2025)0.00
- Mixture-of-mamba: Enhancing Multi-modal State-space Models With Modality-aware Sparsity (2025)3.42
- Mmmmodal -- Multi-images Multi-audio Multi-turn Multi-modal (2024)0.00
- A Multimodal Approach To Device-directed Speech Detection With Large Language Models (2024)7.16