Titanet: Neural Model For Speaker Representation With 1D Depth-wise Separable Convolutions And Global Context
2021 Β· Nithin Rao Koluguri, Taejin Park, Boris Ginsburg
Abstract
In this paper, we propose TitaNet, a novel neural network architecture for extracting speaker representations. We employ 1D depth-wise separable convolutions with Squeeze-and-Excitation (SE) layers with global context followed by channel attention based statistics pooling layer to map variable-length utterances to a fixed-length embedding (t-vector). TitaNet is a scalable architecture and achieves state-of-the-art performance on speaker verification task with an equal error rate (EER) of 0.68% on the VoxCeleb1 trial file and also on speaker diarization tasks with diarization error rate (DER) of 1.73% on AMI-MixHeadset, 1.99% on AMI-Lapel and 1.11% on CH109. Furthermore, we investigate various sizes of TitaNet and present a light TitaNet-S model with only 6M parameters that achieve near state-of-the-art results in diarization tasks.
Authors
(none)
Tags
Stats
Related papers
- Speakernet: 1D Depth-wise Separable Convolutional Network For Text-independent Speaker Recognition And Verification (2020)0.00
- A Compact End-to-end Model With Local And Global Context For Spoken Language Identification (2022)5.84
- ECAPA-TDNN: Emphasized Channel Attention, Propagation And Aggregation In TDNN Based Speaker Verification (2020)23.07
- A Deep Neural Network For Short-segment Speaker Recognition (2019)12.74
- Aca-net: Towards Lightweight Speaker Verification Using Asymmetric Cross Attention (2023)0.00
- Speaker Representation Learning Using Global Context Guided Channel And Time-frequency Transformations (2020)6.34
- CAM++: A Fast And Efficient Network For Speaker Verification Using Context-aware Masking (2023)15.57
- Latent Space Representation For Multi-target Speaker Detection And Identification With A Sparse Dataset Using Triplet Neural Networks (2019)5.24