A Compact End-to-end Model With Local And Global Context For Spoken Language Identification
2022 Β· Fei Jia, Nithin Rao Koluguri, Jagadeesh Balam, et al.
Abstract
We introduce TitaNet-LID, a compact end-to-end neural network for Spoken Language Identification (LID) that is based on the ContextNet architecture. TitaNet-LID employs 1D depth-wise separable convolutions and Squeeze-and-Excitation layers to effectively capture local and global context within an utterance. Despite its small size, TitaNet-LID achieves performance similar to state-of-the-art models on the VoxLingua107 dataset while being 10 times smaller. Furthermore, it can be easily adapted to new acoustic conditions and unseen languages through simple fine-tuning, achieving a state-of-the-art accuracy of 88.2% on the FLEURS benchmark. Our model is scalable and can achieve a better trade-off between accuracy and speed. TitaNet-LID performs well even on short utterances less than 5s in length, indicating its robustness to input length.
Authors
(none)
Tags
Stats
Related papers
- Titanet: Neural Model For Speaker Representation With 1D Depth-wise Separable Convolutions And Global Context (2021)14.90
- Joint Unsupervised And Supervised Learning For Context-aware Language Identification (2023)2.26
- Phonetic Temporal Neural Model For Language Identification (2017)12.40
- BERT-LID: Leveraging BERT To Improve Spoken Language Identification (2022)8.09
- Utterance-level End-to-end Language Identification Using Attention-based CNN-BLSTM (2019)11.67
- Contextnet: Improving Convolutional Neural Networks For Automatic Speech Recognition With Global Context (2020)17.24
- Investigating Context Features Hidden In End-to-end TTS (2018)0.00
- A Deep Neural Network For Short-segment Speaker Recognition (2019)12.74