SIFT-50M: A Large-scale Multilingual Dataset For Speech Instruction Fine-tuning
2025 Β· Prabhat Pandey, Rupak Vignesh Swaminathan, K V Vijay Girish, et al.
Abstract
We introduce SIFT (Speech Instruction Fine-Tuning), a 50M-example dataset designed for instruction fine-tuning and pre-training of speech-text large language models (LLMs). SIFT-50M is built from publicly available speech corpora, which collectively contain 14K hours of speech, and leverages LLMs along with off-the-shelf expert models. The dataset spans five languages, encompassing a diverse range of speech understanding as well as controllable speech generation instructions. Using SIFT-50M, we train SIFT-LLM, which outperforms existing speech-text LLMs on instruction-following benchmarks while achieving competitive performance on foundational speech tasks. To support further research, we also introduce EvalSIFT, a benchmark dataset specifically designed to evaluate the instruction-following capabilities of speech-text LLMs.
Authors
(none)
Tags
Stats
Related papers
- Azeros: Extending LLM To Speech With Self-generated Instruction-free Tuning (2025)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Desta2: Developing Instruction-following Speech Language Model Without Speech Instruction-tuning Data (2024)8.82
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Dynamic-superb: Towards A Dynamic, Collaborative, And Comprehensive Instruction-tuning Benchmark For Speech (2023)0.00
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- MCIF: Multimodal Crosslingual Instruction-following Benchmark From Scientific Talks (2025)0.00