Towards Unsupervised Speaker Diarization System For Multilingual Telephone Calls Using Pre-trained Whisper Model And Mixture Of Sparse Autoencoders

Abstract

Existing speaker diarization systems typically rely on large amounts of manually annotated data, which is labor-intensive and difficult to obtain, especially in real-world scenarios. Additionally, language-specific constraints in these systems significantly hinder their effectiveness and scalability in multilingual settings. In this paper, we propose a cluster-based speaker diarization system designed for multilingual telephone call applications. Our proposed system supports multiple languages and eliminates the need for large-scale annotated data during training by utilizing the multilingual Whisper model to extract speaker embeddings. Additionally, we introduce a network architecture called Mixture of Sparse Autoencoders (Mix-SAE) for unsupervised speaker clustering. Experimental results on the evaluation dataset derived from two-speaker subsets of benchmark CALLHOME and CALLFRIEND telephonic speech corpora demonstrate the superior performance of the proposed Mix-SAE network to other

Towards Unsupervised Speaker Diarization System For Multilingual Telephone Calls Using Pre-trained Whisper Model And Mixture Of Sparse Autoencoders

Abstract

Authors

Tags

Stats

Related papers