The Flyspeech Audio-visual Speaker Diarization System For MISP Challenge 2022
2023 Β· Li Zhang, Huan Zhao, Yue Li, et al.
Abstract
This paper describes the FlySpeech speaker diarization system submitted to the second \textbf\{M\}ultimodal \textbf\{I\}nformation Based \textbf\{S\}peech \textbf\{P\}rocessing~(\textbf\{MISP\}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.
Authors
(none)
Tags
Stats
Related papers
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Royalflush Speaker Diarization System For ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- The NPU-ASLP System For Audio-visual Speech Recognition In MISP 2022 Challenge (2023)7.16
- Taltech-irit-lis Speaker And Language Diarization Systems For DISPLACE 2024 (2024)4.52
- The Xmuspeech System For Multi-channel Multi-party Meeting Transcription Challenge (2022)0.00
- Late Audio-visual Fusion For In-the-wild Speaker Diarization (2022)3.58
- The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-visual Target Speaker Extraction (2023)0.00
- The CUHK-TENCENT Speaker Diarization System For The ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge (2022)7.81