Next-tdnn: Modernizing Multi-scale Temporal Convolution Backbone For Speaker Verification
2023 Β· Hyun-Jun Heo, Ui-Hyeop Shin, Ran Lee, et al.
Abstract
In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TD
Authors
(none)
Tags
Stats
Related papers
- ECAPA-TDNN: Emphasized Channel Attention, Propagation And Aggregation In TDNN Based Speaker Verification (2020)23.07
- MACCIF-TDNN: Multi Aspect Aggregation Of Channel And Context Interdependence Features In Tdnn-based Speaker Verification (2021)6.77
- Layer-aware TDNN: Speaker Recognition Using Multi-layer Features From Pre-trained Models (2024)0.00
- Rsknet-mtsp: Effective And Portable Deep Architecture For Speaker Verification (2021)9.03
- P-vectors: A Parallel-coupled Tdnn/transformer Network For Speaker Verification (2023)5.84
- Decomposed Temporal Dynamic CNN: Efficient Time-adaptive Network For Text-independent Speaker Verification Explained With Speaker Activation Map (2022)0.00
- CAM++: A Fast And Efficient Network For Speaker Verification Using Context-aware Masking (2023)15.57
- MGFF-TDNN: A Multi-granularity Feature Fusion TDNN Model With Depth-wise Separable Module For Speaker Verification (2025)0.00