Improving Transformer-based Networks With Locality For Automatic Speaker Verification
2023 Β· Mufan Sang, Yong Zhao, Gang Liu, et al.
Abstract
Recently, Transformer-based architectures have been explored for speaker embedding extraction. Although the Transformer employs the self-attention mechanism to efficiently model the global interaction between token embeddings, it is inadequate for capturing short-range local context, which is essential for the accurate extraction of speaker information. In this study, we enhance the Transformer with the enhanced locality modeling in two directions. First, we propose the Locality-Enhanced Conformer (LE-Confomer) by introducing depth-wise convolution and channel-wise attention into the Conformer blocks. Second, we present the Speaker Swin Transformer (SST) by adapting the Swin Transformer, originally proposed for vision tasks, into speaker embedding network. We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset. The proposed models achieve 0.75% EER on VoxCeleb 1 test set, outperforming the previously proposed
Authors
(none)
Tags
Stats
Related papers
- An Attention-based Backend Allowing Efficient Fine-tuning Of Transformer Models For Speaker Verification (2022)11.08
- Investigation Of Speaker-adaptation Methods In Transformer Based ASR (2020)0.00
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- An Effective Transformer-based Contextual Model And Temporal Gate Pooling For Speaker Identification (2023)1.81
- T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model (2020)0.00
- Transformers With Convolutional Context For ASR (2019)0.00
- Unified Hypersphere Embedding For Speaker Recognition (2018)0.00
- Editnet: A Lightweight Network For Unsupervised Domain Adaptation In Speaker Verification (2022)5.84