End-to-end Single-channel Speaker-turn Aware Conversational Speech Translation
2023 Β· Juan Zuluaga-Gomez, Zhaocheng Huang, Xing Niu, et al.
Abstract
Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable perform
Authors
(none)
Tags
Stats
Related papers
- Multilingual End-to-end Speech Translation (2019)0.00
- Synchronous Speech Recognition And Speech-to-text Translation With Interactive Decoding (2019)10.48
- Speaker Conditioned Acoustic Modeling For Multi-speaker Conversational ASR (2021)4.52
- End-to-end Speech Translation With Knowledge Distillation (2019)0.00
- Long-form End-to-end Speech Translation Via Latent Alignment Segmentation (2023)0.00
- Rethinking And Improving Multi-task Learning For End-to-end Speech Translation (2023)5.84
- Joint Training And Decoding For Multilingual End-to-end Simultaneous Speech Translation (2025)0.95
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00