An Effective Transformer-based Contextual Model And Temporal Gate Pooling For Speaker Identification
2023 Β· Harunori Kawano, Sota Shimizu
Abstract
Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the hyper-parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.
Authors
(none)
Tags
Stats
Code
Related papers
- Improving Transformer-based Networks With Locality For Automatic Speaker Verification (2023)0.00
- T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model (2020)0.00
- Exploring Wav2vec 2.0 On Speaker Verification And Language Identification (2020)15.59
- An Attention-based Backend Allowing Efficient Fine-tuning Of Transformer Models For Speaker Verification (2022)11.08
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16
- Multi-task Voice Activated Framework Using Self-supervised Learning (2021)6.34
- Fine-tuning Wav2vec2 For Speaker Recognition (2021)18.88
- Investigation Of Speaker-adaptation Methods In Transformer Based ASR (2020)0.00