Whisper-pmfa: Partial Multi-scale Feature Aggregation For Speaker Verification Using Whisper Models
2024 Β· Yiyang Zhao, Shuai Wang, Guangzhi Sun, et al.
Abstract
In this paper, Whisper, a large-scale pre-trained model for automatic speech recognition, is proposed to apply to speaker verification. A partial multi-scale feature aggregation (PMFA) approach is proposed based on a subset of Whisper encoder blocks to derive highly discriminative speaker embeddings.Experimental results demonstrate that using the middle to later blocks of the Whisper encoder keeps more speaker information. On the VoxCeleb1 and CN-Celeb1 datasets, our system achieves 1.42% and 8.23% equal error rates (EERs) respectively, receiving 0.58% and 1.81% absolute EER reductions over the ECAPA-TDNN baseline, and 0.46% and 0.97% over the ResNet34 baseline. Furthermore, our results indicate that using Whisper models trained on multilingual data can effectively enhance the model's robustness across languages. Finally, the low-rank adaptation approach is evaluated, which reduces the trainable model parameters by approximately 45 times while only slightly increasing EER by 0.2%.
Authors
(none)
Tags
Stats
Related papers
- Mfa-conformer: Multi-scale Feature Aggregation Conformer For Automatic Speaker Verification (2022)15.46
- Whisper Speaker Identification: Leveraging Pre-trained Multilingual Transformers For Robust Speaker Embeddings (2025)0.00
- Dq-whisper: Joint Distillation And Quantization For Efficient Multilingual Speech Recognition (2023)4.52
- M2r-whisper: Multi-stage And Multi-scale Retrieval Augmentation For Enhancing Whisper (2024)6.77
- Improving Multi-scale Aggregation Using Feature Pyramid Module For Robust Speaker Verification Of Variable-duration Utterances (2020)10.48
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Improving Speaker Representations Using Contrastive Losses On Multi-scale Features (2024)0.00