Mimo-audio: Audio Language Models Are Few-shot Learners
2025 Β· Core Team, Dong Zhang, Gang Wang, et al.
Abstract
Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speec
Authors
(none)
Tags
Stats
Related papers
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Pengi: An Audio Language Model For Audio Tasks (2023)10.35
- Enhancing Low-resource Language And Instruction Following Capabilities Of Audio Language Models (2024)2.26
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Mowe-audio: Multitask Audiollms With Mixture Of Weak Encoders (2024)3.58