Flow-tsvad: Target-speaker Voice Activity Detection Via Latent Flow Matching
2024 Β· Zhengyang Chen, Bing Han, Shuai Wang, et al.
Abstract
Speaker diarization is typically considered a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore the use of neural network-based generative methods for speaker diarization for the first time. We implement a Flow-Matching (FM) based generative algorithm within the sequence-to-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system. Our experiments reveal that applying the generative method directly to the original binary label sequence space of the TS-VAD output is ineffective. To address this issue, we propose mapping the binary label sequence into a dense latent space before applying the generative algorithm and our proposed Flow-TSVAD method outperforms the Seq2Seq-TSVAD system. Additionally, we observe that the FM algorithm converges rapidly during the inference stage, requiring only two inference steps to achieve promising results. As a generative model, Flow-TSVAD allows for sampling diff
Authors
(none)
Tags
Stats
Related papers
- Diflow-tts: Compact And Low-latency Zero-shot Text-to-speech With Factorized Discrete Flow Matching (2025)0.00
- Generative Pre-training For Speech With Flow Matching (2023)0.00
- Vssflow: Unifying Video-conditioned Sound And Speech Generation Via Joint Learning (2025)0.00
- Voiceflow: Efficient Text-to-speech With Rectified Flow Matching (2023)0.00
- V2sflow: Video-to-speech Generation With Speech Decomposition And Rectified Flow (2024)8.52
- Target-speaker Voice Activity Detection Via Sequence-to-sequence Prediction (2022)11.19
- F5-TTS: A Fairytaler That Fakes Fluent And Faithful Speech With Flow Matching (2024)0.00
- Time-layer Adaptive Alignment For Speaker Similarity In Flow-matching Based Zero-shot TTS (2025)0.00