Deepvox: Discovering Features From Raw Audio For Speaker Recognition In Non-ideal Audio Signals
2020 Β· Anurag Chowdhury, Arun Ross
Abstract
Automatic speaker recognition algorithms typically use pre-defined filterbanks, such as Mel-Frequency and Gammatone filterbanks, for characterizing speech audio. However, it has been observed that the features extracted using these filterbanks are not resilient to diverse audio degradations. In this work, we propose a deep learning-based technique to deduce the filterbank design from vast amounts of speech audio. The purpose of such a filterbank is to extract features robust to non-ideal audio conditions, such as degraded, short duration, and multi-lingual speech. To this effect, a 1D convolutional neural network is designed to learn a time-domain filterbank called DeepVOX directly from raw speech audio. Secondly, an adaptive triplet mining technique is developed to efficiently mine the data samples best suited to train the filterbank. Thirdly, a detailed ablation study of the DeepVOX filterbanks reveals the presence of both vocal source and vocal tract characteristics in the extracted
Authors
(none)
Tags
Stats
Related papers
- Voxceleb2: Deep Speaker Recognition (2018)23.96
- A Comparative Re-assessment Of Feature Extractors For Deep Speaker Embeddings (2020)8.09
- Optimization Of Data-driven Filterbank For Automatic Speaker Verification (2020)11.93
- Audio-to-image Encoding For Improved Voice Characteristic Detection Using Deep Convolutional Neural Networks (2025)2.26
- Obovox Far Field Speaker Recognition: A Novel Data Augmentation Approach With Pretrained Models (2024)0.00
- FDN: Finite Difference Network With Hierarchical Convolutional Features For Text-independent Speaker Verification (2021)0.00
- Deeptalk: Vocal Style Encoding For Speaker Recognition And Speech Synthesis (2020)5.24
- A Unified Deep Learning Framework For Short-duration Speaker Verification In Adverse Environments (2020)9.41