Whisper Speaker Identification: Leveraging Pre-trained Multilingual Transformers For Robust Speaker Embeddings
2025 Β· Jakaria Islam Emon, Md Abu Salek, Kazi Tamanna Alam
Abstract
Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and
Authors
(none)
Tags
Stats
Related papers
- Incorporating Talker Identity Aids With Improving Speech Recognition In Adversarial Environments (2024)0.00
- Whisper-pmfa: Partial Multi-scale Feature Aggregation For Speaker Verification Using Whisper Models (2024)0.00
- On The Transferability Of Whisper-based Representations For "in-the-wild" Cross-task Downstream Speech Applications (2023)0.00
- Recursive Whitening Transformation For Speaker Recognition On Language Mismatched Condition (2017)3.58
- Whispervc: Decoupled Cross-domain Alignment And Speech Generation For Low-resource Whisper-to-normal Conversion (2025)0.00
- End-to-end Whisper To Natural Speech Conversion Using Modified Transformer Network (2020)0.00
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Cross-lingual Transfer Learning For Speech Translation (2024)6.34