Layer-aware TDNN: Speaker Recognition Using Multi-layer Features From Pre-trained Models
2024 Β· Jin Sob Kim, Hyun Joon Park, Wooseok Shin, et al.
Abstract
Recent advances in self-supervised learning (SSL) on Transformers have significantly improved speaker verification (SV) by providing domain-general speech representations. However, existing approaches have underutilized the multi-layered nature of SSL encoders. To address this limitation, we propose the layer-aware time-delay neural network (L-TDNN), which directly performs layer/frame-wise processing on the layer-wise hidden state outputs from pre-trained models, extracting fixed-size speaker vectors. L-TDNN comprises a layer-aware convolutional network, a frame-adaptive layer aggregation, and attentive statistic pooling, explicitly modeling of the recognition and processing of previously overlooked layer dimension. We evaluated L-TDNN across multiple speech SSL Transformers and diverse speech-speaker corpora against other approaches for leveraging pre-trained encoders. L-TDNN consistently demonstrated robust verification performance, achieving the lowest error rates throughout the ex
Authors
(none)
Tags
Stats
Related papers
- DS-TDNN: Dual-stream Time-delay Neural Network With Global-aware Filter For Speaker Verification (2023)8.60
- P-vectors: A Parallel-coupled Tdnn/transformer Network For Speaker Verification (2023)5.84
- ECAPA-TDNN: Emphasized Channel Attention, Propagation And Aggregation In TDNN Based Speaker Verification (2020)23.07
- Next-tdnn: Modernizing Multi-scale Temporal Convolution Backbone For Speaker Verification (2023)10.07
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- MFA: TDNN With Multi-scale Frequency-channel Attention For Text-independent Speaker Verification With Short Utterances (2022)13.79
- MGFF-TDNN: A Multi-granularity Feature Fusion TDNN Model With Depth-wise Separable Module For Speaker Verification (2025)0.00
- Deep Speaker Feature Learning For Text-independent Speaker Verification (2017)12.54