NEST: Self-supervised Fast Conformer As All-purpose Seasoning To Speech Processing Tasks
2024 Β· He Huang, Taejin Park, Kunal Dhawan, et al.
Abstract
Self-supervised learning has been proved to benefit a wide range of speech processing tasks, such as speech recognition/translation, speaker verification and diarization, etc. However, most of current approaches are computationally expensive. In this paper, we propose a simplified and more efficient self-supervised learning framework termed as NeMo Encoder for Speech Tasks (NEST). Specifically, we adopt the FastConformer architecture with 8x sub-sampling rate, which is faster than Transformer or Conformer architectures. Instead of clustering-based quantization, we use fixed random projection for its simplicity and effectiveness. We also implement a generalized noisy speech augmentation that teaches the model to disentangle the main speaker from noise or other speakers. Experiments show that \model improves over existing self-supervised models and achieves new state-of-the-art performance on a variety of speech processing tasks, such as speech recognition/translation, speaker diarizatio
Authors
(none)
Tags
Stats
Related papers
- Universal Paralinguistic Speech Representations Using Self-supervised Conformers (2021)10.48
- Nextformer: A Convnext Augmented Conformer For End-to-end Speech Recognition (2022)0.00
- Accidental Learners: Spoken Language Identification In Multilingual Self-supervised Models (2022)5.84
- Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition (2023)14.47
- Conformer-based Self-supervised Learning For Non-speech Audio Tasks (2021)7.50
- PCNN: A Lightweight Parallel Conformer Neural Network For Efficient Monaural Speech Enhancement (2023)6.77
- Recent Developments On Espnet Toolkit Boosted By Conformer (2020)0.00
- Df-conformer: Integrated Architecture Of Conv-tasnet And Conformer Using Linear Complexity Self-attention For Speech Enhancement (2021)11.29