OSUM: Advancing Open Speech Understanding Models With Limited Resources In Academia
2025 Β· Xuelong Geng, Kun Wei, Qijie Shao, et al.
Abstract
Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP),
Authors
(none)
Tags
Stats
Related papers
- A Survey On Speech Large Language Models For Understanding (2024)4.52
- Unislu: Unified Spoken Language Understanding From Heterogeneous Cross-task Datasets (2025)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- A Study On The Integration Of Pre-trained SSL, ASR, LM And SLU Models For Spoken Language Understanding (2022)8.09
- Exploring Fine-tuning Of Large Audio Language Models For Spoken Language Understanding Under Limited Speech Data (2025)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Roadmap Towards Superhuman Speech Understanding Using Large Language Models (2024)0.00
- Paralinguistics-aware Speech-empowered Large Language Models For Natural Conversation (2024)3.96