An Improved Deep Neural Network For Modeling Speaker Characteristics At Different Temporal Scales
2020 Β· Bin Gu, Wu Guo
Abstract
This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution (MSCNN) is adopted in frame-level layers to capture complementary speaker information in different receptive fields. (2) A Baum-Welch statistics attention (BWSA) mechanism is applied in pooling-layer, which can integrate more useful long-term speaker characteristics in the temporal pooling layer. Experiments are carried out on the NIST SRE16 evaluation set. The results demonstrate the effectiveness of MSCNN and show the proposed BWSA can further improve the performance of the DNN embedding system
Authors
(none)
Tags
Stats
Related papers
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- Deep Neural Network Embeddings With Gating Mechanisms For Text-independent Speaker Verification (2019)8.82
- Multi-task Learning With High-order Statistics For X-vector Based Text-independent Speaker Verification (2019)8.35
- Y-vector: Multiscale Waveform Encoder For Speaker Embedding (2020)8.60
- On Deep Speaker Embeddings For Text-independent Speaker Recognition (2018)11.93
- Unified Hypersphere Embedding For Speaker Recognition (2018)0.00
- Embeddings For DNN Speaker Adaptive Training (2019)7.16
- How To Improve Your Speaker Embeddings Extractor In Generic Toolkits (2018)9.76