Multi-task Voice Activated Framework Using Self-supervised Learning
2021 Β· Shehzeen Hussain, van Nguyen, Shuhua Zhang, et al.
Abstract
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are learned without any task-specific supervision, they can also be useful for other voice-activated tasks like speaker verification, keyword spotting, emotion classification etc. In our work, we propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks. We develop downstream network architectures that operate on the contextualized speech representations of wav2vec 2.0 to adapt the representations for solving a given task. Finally, we extend our framework to perform multi-task learning by jointly optimizing the network parameters on multiple voice activated tasks using a shared transformer backbone. Both of our single and multi-task frameworks achieve state-of-the-art results in speaker verification a
Authors
(none)
Tags
Stats
Related papers
- Multitask Detection Of Speaker Changes, Overlapping Speech And Voice Activity Using Wav2vec 2.0 (2022)11.86
- Multichannel Av-wav2vec2: A Framework For Learning Multichannel Multi-modal Speech Representation (2024)7.16
- Exploring Wav2vec 2.0 On Speaker Verification And Language Identification (2020)15.59
- Multi-task Network For Noise-robust Keyword Spotting And Speaker Verification Using Ctc-based Soft VAD And Global Query Attention (2020)9.41
- A Noise-robust Self-supervised Pre-training Model Based Speech Representation Learning For Automatic Speech Recognition (2022)11.19
- Wavlm: Large-scale Self-supervised Pre-training For Full Stack Speech Processing (2021)24.00
- Simultaneous Or Sequential Training? How Speech Representations Cooperate In A Multi-task Self-supervised Learning System (2023)3.58
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00