Temporal Dynamic Convolutional Neural Network For Text-independent Speaker Verification And Phonemetic Analysis
2021 Β· Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park
Abstract
In the field of text-independent speaker recognition, dynamic models that adapt along the time axis have been proposed to consider the phoneme-varying characteristics of speech. However, a detailed analysis of how dynamic models work depending on phonemes is insufficient. In this paper, we propose temporal dynamic CNN (TDY-CNN) that considers temporal variation of phonemes by applying kernels optimally adapting to each time bin. These kernels adapt to time bins by applying weighted sum of trained basis kernels. Then, an analysis of how adaptive kernels work on different phonemes in various layers is carried out. TDY-ResNet-38(x0.5) using six basis kernels improved an equal error rate (EER), the speaker verification performance, by 17.3% compared to the baseline model ResNet-38(x0.5). In addition, we showed that adaptive kernels depend on phoneme groups and are more phoneme-specific at early layers. The temporal dynamic model adapts itself to phonemes without explicitly given phoneme in
Authors
(none)
Tags
Stats
Related papers
- Decomposed Temporal Dynamic CNN: Efficient Time-adaptive Network For Text-independent Speaker Verification Explained With Speaker Activation Map (2022)0.00
- Dynamic Kernels And Channel Attention For Low Resource Speaker Verification (2022)0.00
- Frequency And Temporal Convolutional Attention For Text-independent Speaker Recognition (2019)0.00
- ECAPA-TDNN: Emphasized Channel Attention, Propagation And Aggregation In TDNN Based Speaker Verification (2020)23.07
- Improved Tdnns Using Deep Kernels And Frequency Dependent Grid-rnns (2018)8.82
- Next-tdnn: Modernizing Multi-scale Temporal Convolution Backbone For Speaker Verification (2023)10.07
- Integrating Frequency Translational Invariance In Tdnns And Frequency Positional Information In 2D Resnets To Enhance Speaker Verification (2021)12.68
- Deep Speaker Feature Learning For Text-independent Speaker Verification (2017)12.54