Thinking In Cocktail Party: Chain-of-thought And Reinforcement Learning For Target Speaker Automatic Speech Recognition
2025 Β· Yiru Zhang, Hang Su, Lichun Fan, et al.
Abstract
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected dat
Authors
(none)
Tags
Stats
Related papers
- Step-audio-r1.5 Technical Report (2026)0.00
- Internalizing ASR With Implicit Chain Of Thought For Efficient Speech-to-speech Conversational LLM (2024)0.00
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- ASRRL-TTS: Agile Speaker Representation Reinforcement Learning For Text-to-speech Speaker Adaptation (2024)0.00
- Listening While Speaking And Visualizing: Improving ASR Through Multimodal Chain (2019)4.52
- Sequence-to-sequence ASR Optimization Via Reinforcement Learning (2017)9.41
- RALL-E: Robust Codec Language Modeling With Chain-of-thought Prompting For Text-to-speech Synthesis (2024)0.00
- Multi-speaker ASR Combining Non-autoregressive Conformer CTC And Conditional Speaker Chain (2021)11.31