Frequency And Temporal Convolutional Attention For Text-independent Speaker Recognition
2019 Β· Sarthak Yadav, Atul Rai
Abstract
Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb [2, 3] speaker verification benchmark, and our best model achieves an equal error rate of 2:031% on the VoxCeleb1 test set, improving the existing state of the art result by a significant margin. For a more thorough assessment of the effects of frequency and temporal attention in real-world cond
Authors
(none)
Tags
Stats
Related papers
- Convolution-based Channel-frequency Attention For Text-independent Speaker Verification (2022)7.50
- Duality Temporal-channel-frequency Attention Enhanced Speaker Representation Learning (2021)5.24
- Multi-frequency Information Enhanced Channel Attention Module For Speaker Representation Learning (2022)0.00
- Self Multi-head Attention For Speaker Recognition (2019)13.84
- Attention And DCT Based Global Context Modeling For Text-independent Speaker Recognition (2022)7.50
- End-to-end Attention Based Text-dependent Speaker Verification (2017)14.87
- An Improved Deep Neural Network For Modeling Speaker Characteristics At Different Temporal Scales (2020)6.34
- Multi-stream Convolutional Neural Network With Frequency Selection For Robust Speaker Verification (2020)3.58