Utterance-level Aggregation For Speaker Recognition In The Wild
2019 Β· Weidi Xie, Arsha Nagrani, Joon Son Chung, et al.
Abstract
The objective of this paper is speaker recognition "in the wild"-where utterances may be of variable length and also contain irrelevant signals. Crucial elements in the design of deep networks for this task are the type of trunk (frame level) network, and the method of temporal aggregation. We propose a powerful speaker recognition deep network, using a "thin-ResNet" trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end. We show that our network achieves state of the art performance by a significant margin on the VoxCeleb1 test set for speaker recognition, whilst requiring fewer parameters than previous methods. We also investigate the effect of utterance length on performance, and conclude that for "in the wild" data, a longer length is beneficial.
Authors
(none)
Tags
Stats
Related papers
- A Deep Neural Network For Short-segment Speaker Recognition (2019)12.74
- Length- And Noise-aware Training Techniques For Short-utterance Speaker Recognition (2020)0.00
- Rawnext: Speaker Verification System For Variable-duration Utterances With Deep Layer Aggregation And Extended Dynamic Scaling Policies (2021)12.24
- Speakernet: 1D Depth-wise Separable Convolutional Network For Text-independent Speaker Recognition And Verification (2020)0.00
- DNN Based Speaker Recognition On Short Utterances (2016)0.00
- Voxceleb2: Deep Speaker Recognition (2018)23.96
- Training Speaker Embedding Extractors Using Multi-speaker Audio With Unknown Speaker Boundaries (2022)3.58
- Self-attentive Multi-layer Aggregation With Feature Recalibration And Normalization For End-to-end Speaker Verification System (2020)0.00