Abstract

This paper describes the FlySpeech speaker diarization system submitted to the second \textbf\{M\}ultimodal \textbf\{I\}nformation Based \textbf\{S\}peech \textbf\{P\}rocessing~(\textbf\{MISP\}) Challenge held in ICASSP 2022. We develop an end-to-end audio-visual speaker diarization~(AVSD) system, which consists of a lip encoder, a speaker encoder, and an audio-visual decoder. Specifically, to mitigate the degradation of diarization performance caused by separate training, we jointly train the speaker encoder and the audio-visual decoder. In addition, we leverage the large-data pretrained speaker extractor to initialize the speaker encoder.

Authors

(none)

Tags

  • Text-to-Speech

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyzhang2023the

Related papers