ECAPA-TDNN: Emphasized Channel Attention, Propagation And Aggregation In TDNN Based Speaker Verification
2020 Β· Brecht Desplanques, Jenthe Thienpondt, Kris Demuynck
Abstract
Current speaker verification techniques rely on a neural network to extract speaker representations. The successful x-vector architecture is a Time Delay Neural Network (TDNN) that applies statistics pooling to project variable-length utterances into fixed-length speaker characterizing embeddings. In this paper, we propose multiple enhancements to this architecture based on recent trends in the related fields of face verification and computer vision. Firstly, the initial frame layers can be restructured into 1-dimensional Res2Net modules with impactful skip connections. Similarly to SE-ResNet, we introduce Squeeze-and-Excitation blocks in these modules to explicitly model channel interdependencies. The SE block expands the temporal context of the frame layer by rescaling the channels according to global properties of the recording. Secondly, neural networks are known to learn hierarchical features, with each layer operating on a different level of complexity. To leverage this complemen
Authors
(none)
Tags
Stats
Related papers
- MACCIF-TDNN: Multi Aspect Aggregation Of Channel And Context Interdependence Features In Tdnn-based Speaker Verification (2021)6.77
- Next-tdnn: Modernizing Multi-scale Temporal Convolution Backbone For Speaker Verification (2023)10.07
- CAM++: A Fast And Efficient Network For Speaker Verification Using Context-aware Masking (2023)15.57
- Speechnas: Towards Better Trade-off Between Latency And Accuracy For Large-scale Speaker Verification (2021)9.76
- P-vectors: A Parallel-coupled Tdnn/transformer Network For Speaker Verification (2023)5.84
- MFA: TDNN With Multi-scale Frequency-channel Attention For Text-independent Speaker Verification With Short Utterances (2022)13.79
- End-to-end Attention Based Text-dependent Speaker Verification (2017)14.87
- DS-TDNN: Dual-stream Time-delay Neural Network With Global-aware Filter For Speaker Verification (2023)8.60