Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification
2019 Β· Yun Tang, Guohong Ding, Jing Huang, et al.
Abstract
This paper aims to improve the widely used deep speaker embedding x-vector model. We propose the following improvements: (1) a hybrid neural network structure using both time delay neural network (TDNN) and long short-term memory neural networks (LSTM) to generate complementary speaker information at different levels; (2) a multi-level pooling strategy to collect speaker information from both TDNN and LSTM layers; (3) a regularization scheme on the speaker embedding extraction layer to make the extracted embeddings suitable for the following fusion step. The synergy of these improvements are shown on the NIST SRE 2016 eval test (with a 19% EER reduction) and SRE 2018 dev test (with a 9% EER reduction), as well as more than 10% DCF scores reduction on these two test sets over the x-vector baseline.
Authors
(none)
Tags
Stats
Related papers
- An Improved Deep Neural Network For Modeling Speaker Characteristics At Different Temporal Scales (2020)6.34
- Multi-task Learning With High-order Statistics For X-vector Based Text-independent Speaker Verification (2019)8.35
- Y-vector: Multiscale Waveform Encoder For Speaker Embedding (2020)8.60
- Deep Speaker Embeddings For Far-field Speaker Recognition On Short Utterances (2020)11.29
- Deep Neural Network Embeddings With Gating Mechanisms For Text-independent Speaker Verification (2019)8.82
- Attentive Statistics Pooling For Deep Speaker Embedding (2018)18.88
- Triplet Based Embedding Distance And Similarity Learning For Text-independent Speaker Verification (2019)5.24
- On Deep Speaker Embeddings For Text-independent Speaker Recognition (2018)11.93