Attention-based Conditioning Methods Using Variable Frame Rate For Style-robust Speaker Verification
2022 Β· Amber Afshan, Abeer Alwan
Abstract
We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation wit
Authors
(none)
Tags
Stats
Related papers
- Variable Frame Rate-based Data Augmentation To Handle Speaking-style Variability For Automatic Speaker Verification (2020)3.58
- End-to-end Attention Based Text-dependent Speaker Verification (2017)14.87
- Attentive Statistics Pooling For Deep Speaker Embedding (2018)18.88
- Rethinking Session Variability: Leveraging Session Embeddings For Session Robustness In Speaker Verification (2023)5.24
- Disentangled Speaker And Nuisance Attribute Embedding For Robust Speaker Verification (2020)8.60
- Deep Representation Decomposition For Rate-invariant Speaker Verification (2022)2.26
- Deep Segment Attentive Embedding For Duration Robust Speaker Verification (2018)2.26
- Self Multi-head Attention For Speaker Recognition (2019)13.84