Y-vector: Multiscale Waveform Encoder For Speaker Embedding
2020 Β· Ge Zhu, Fei Jiang, Zhiyao Duan
Abstract
State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features. Recent studies attempted to extract speaker embeddings directly from raw waveforms and have shown competitive results. In this paper, we propose a novel multi-scale waveform encoder that uses three convolution branches with different time scales to compute speech features from the waveform. These features are then processed by squeeze-and-excitation blocks, a multi-level feature aggregator, and a time delayed neural network (TDNN) to compute speaker embedding. We show that the proposed embeddings outperform existing raw-waveform-based speaker embeddings on speaker verification by a large margin. A further analysis of the learned filters shows that the multi-scale encoder attends to different frequency bands at its different scales while resulting in a more flat overall frequency response than any of the single-scale counterparts.
Authors
(none)
Tags
Stats
Related papers
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- An Improved Deep Neural Network For Modeling Speaker Characteristics At Different Temporal Scales (2020)6.34
- Improved Rawnet With Feature Map Scaling For Text-independent Speaker Verification Using Raw Waveforms (2020)14.15
- Multi-task Learning With High-order Statistics For X-vector Based Text-independent Speaker Verification (2019)8.35
- Unified Hypersphere Embedding For Speaker Recognition (2018)0.00
- A Comparative Re-assessment Of Feature Extractors For Deep Speaker Embeddings (2020)8.09
- Deep Neural Network Embeddings With Gating Mechanisms For Text-independent Speaker Verification (2019)8.82
- Multi-channel Speaker Verification For Single And Multi-talker Speech (2020)0.00