Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization
2023 Β· Huan Zhao, Li Zhang, Yue Li, et al.
Abstract
The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
Authors
(none)
Tags
Stats
Related papers
- Late Audio-visual Fusion For In-the-wild Speaker Diarization (2022)3.58
- Joint Training Of Speaker Embedding Extractor, Speech And Overlap Detection For Diarization (2024)2.26
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- The Flyspeech Audio-visual Speaker Diarization System For MISP Challenge 2022 (2023)0.00
- Avformer: Injecting Vision Into Frozen Speech Models For Zero-shot AV-ASR (2023)7.81
- Audio-visual Speaker Diarization Based On Spatiotemporal Bayesian Fusion (2016)14.51
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29