Leveraging Unpaired Text Data For Training End-to-end Speech-to-intent Systems
2020 Β· Yinghui Huang, Hong-Kwang Kuo, Samuel Thomas, et al.
Abstract
Training an end-to-end (E2E) neural network speech-to-intent (S2I) system that directly extracts intents from speech requires large amounts of intent-labeled speech data, which is time consuming and expensive to collect. Initializing the S2I model with an ASR model trained on copious speech data can alleviate data sparsity. In this paper, we attempt to leverage NLU text resources. We implemented a CTC-based S2I system that matches the performance of a state-of-the-art, traditional cascaded SLU system. We performed controlled experiments with varying amounts of speech and text training data. When only a tenth of the original data is available, intent classification accuracy degrades by 7.6% absolute. Assuming we have additional text-to-intent data (without speech) available, we investigated two techniques to improve the S2I system: (1) transfer learning, in which acoustic embeddings for intent classification are tied to fine-tuned BERT text embeddings; and (2) data augmentation, in whic
Authors
(none)
Tags
Stats
Related papers
- Towards Reducing The Need For Speech Training Data To Build Spoken Language Understanding Systems (2022)8.35
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Improving End-to-end Speech Processing By Efficient Text Data Utilization With Latent Synthesis (2023)0.00
- Improving Transducer-based Spoken Language Understanding With Self-conditioned CTC And Knowledge Transfer (2025)0.00
- Exploring Transfer Learning For End-to-end Spoken Language Understanding (2020)5.24
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85
- Towards Transfer Learning For End-to-end Speech Synthesis From Deep Pre-trained Language Models (2019)0.00
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59