The Flyspeech Audio-visual Speaker Diarization System For MISP Challenge 2022

Abstract

This paper describes the FlySpeech speaker diarization system submitted to the second \textbf\{M\}ultimodal \textbf\{I\}nformation Based \textbf\{S\}peech \textbf\{P\}rocessing~(\textbf\{MISP\}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.

The Flyspeech Audio-visual Speaker Diarization System For MISP Challenge 2022

Abstract

Authors

Tags

Stats

Related papers