A General Multi-task Learning Framework To Leverage Text Data For Speech To Text Tasks
2020 Β· Yun Tang, Juan Pino, Changhan Wang, et al.
Abstract
Attention-based sequence-to-sequence modeling provides a powerful and elegant solution for applications that need to map one sequence to a different sequence. Its success heavily relies on the availability of large amounts of training data. This presents a challenge for speech applications where labelled speech data is very expensive to obtain, such as automatic speech recognition (ASR) and speech translation (ST). In this study, we propose a general multi-task learning framework to leverage text data for ASR and ST tasks. Two auxiliary tasks, a denoising autoencoder task and machine translation task, are proposed to be co-trained with ASR and ST tasks respectively. We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks. Our experiments show that the proposed method achieves a relative 10~15% word error rate reduction on the English Libris
Authors
(none)
Tags
Stats
Related papers
- Multitask Training With Text Data For End-to-end Speech Recognition (2020)7.50
- Almost Unsupervised Text To Speech And Automatic Speech Recognition (2019)0.00
- Improving Sequence-to-sequence Acoustic Modeling By Adding Text-supervision (2018)9.92
- Semi-supervised Sequence-to-sequence ASR Using Unpaired Speech And Text (2019)0.00
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59
- Towards Unsupervised Speech-to-text Translation (2018)0.00
- Data Efficient Direct Speech-to-text Translation With Modality Agnostic Meta-learning (2019)0.00
- Mmspeech: Multi-modal Multi-task Encoder-decoder Pre-training For Speech Recognition (2022)6.34