Speaker Representation Learning Using Global Context Guided Channel And Time-frequency Transformations
2020 Β· Wei Xia, John H. L. Hansen
Abstract
In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to analyze various factors that may impact the performance of the proposed modules. We find that by employing
Authors
(none)
Tags
Stats
Related papers
- Attention And DCT Based Global Context Modeling For Text-independent Speaker Recognition (2022)7.50
- Duality Temporal-channel-frequency Attention Enhanced Speaker Representation Learning (2021)5.24
- Multi-frequency Information Enhanced Channel Attention Module For Speaker Representation Learning (2022)0.00
- Contextnet: Improving Convolutional Neural Networks For Automatic Speech Recognition With Global Context (2020)17.24
- Improving Transformer-based Networks With Locality For Automatic Speaker Verification (2023)0.00
- Rsknet-mtsp: Effective And Portable Deep Architecture For Speaker Verification (2021)9.03
- MACCIF-TDNN: Multi Aspect Aggregation Of Channel And Context Interdependence Features In Tdnn-based Speaker Verification (2021)6.77
- CAM++: A Fast And Efficient Network For Speaker Verification Using Context-aware Masking (2023)15.57