Diarist: Streaming Speech Translation With Speaker Diarization
2023 Β· Mu Yang, Naoyuki Kanda, Xiaofei Wang, et al.
Abstract
End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for
Authors
(none)
Tags
Stats
Related papers
- Scdiar: A Streaming Diarization System Based On Speaker Change Detection And Speech Recognition (2025)2.26
- Online Streaming End-to-end Neural Diarization Handling Overlapping Speech And Flexible Numbers Of Speakers (2021)0.00
- Aligning Speakers: Evaluating And Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (extended Version) (2023)0.00
- One Model To Rule Them All ? Towards End-to-end Joint Speaker Diarization And Speech Recognition (2023)9.59
- Exploring Speaker-related Information In Spoken Language Understanding For Better Speaker Diarization (2023)0.00
- Streamatt: Direct Streaming Speech-to-text Translation With Attention-based Audio History Selection (2024)4.52
- Transcribe-to-diarize: Neural Speaker Diarization For Unlimited Number Of Speakers Using End-to-end Speaker-attributed ASR (2021)11.49
- Direct Simultaneous Speech-to-text Translation Assisted By Synchronized Streaming ASR (2021)6.77