Phonetic-attention Scoring For Deep Speaker Features In Speaker Verification
2018 Β· Lantian Li, Zhiyuan Tang, Ying Shi, et al.
Abstract
Recent studies have shown that frame-level deep speaker features can be derived from a deep neural network with the training target set to discriminate speakers by a short speech segment. By pooling the frame-level features, utterance-level representations, called d-vectors, can be derived and used in the automatic speaker verification (ASV) task. This simple average pooling, however, is inherently sensitive to the phonetic content of the utterance. An interesting idea borrowed from machine translation is the attention-based mechanism, where the contribution of an input word to the translation at a particular time is weighted by an attention score. This score reflects the relevance of the input word and the present translation. We can use the same idea to align utterances with different phonetic contents. This paper proposes a phonetic-attention scoring approach for d-vector systems. By this approach, an attention score is computed for each frame pair. This score reflects the similarit
Authors
(none)
Tags
Stats
Related papers
- End-to-end Attention Based Text-dependent Speaker Verification (2017)14.87
- Attentive Statistics Pooling For Deep Speaker Embedding (2018)18.88
- Self-adaptive Soft Voice Activity Detection Using Deep Neural Networks For Robust Speaker Verification (2019)6.77
- Parameter-free Attentive Scoring For Speaker Verification (2022)3.58
- Exploring A Unified Attention-based Pooling Framework For Speaker Verification (2018)6.77
- Attention Mechanism In Speaker Recognition: What Does It Learn In Deep Speaker Embedding? (2018)8.60
- Text-independent Speaker Verification Based On Deep Neural Networks And Segmental Dynamic Time Warping (2018)3.58
- Self Multi-head Attention For Speaker Recognition (2019)13.84