Improving End-to-end Speech Processing By Efficient Text Data Utilization With Latent Synthesis
2023 Β· Jianqiao Lu, Wenyong Huang, Nianzu Zheng, et al.
Abstract
Training a high performance end-to-end speech (E2E) processing model requires an enormous amount of labeled speech data, especially in the era of data-centric artificial intelligence. However, labeled speech data are usually scarcer and more expensive for collection, compared to textual data. We propose Latent Synthesis (LaSyn), an efficient textual data utilization framework for E2E speech processing models. We train a latent synthesizer to convert textual data into an intermediate latent representation of a pre-trained speech model. These pseudo acoustic representations of textual data augment acoustic data for model training. We evaluate LaSyn on low-resource automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. For ASR, LaSyn improves an E2E baseline trained on LibriSpeech train-clean-100, with relative word error rate reductions over 22.3% on different test sets. For SLU, LaSyn improves our E2E baseline by absolute 4.1% for intent classification accurac
Authors
(none)
Tags
Stats
Related papers
- Using Speech Synthesis To Train End-to-end Spoken Language Understanding Models (2019)9.23
- Towards Reducing The Need For Speech Training Data To Build Spoken Language Understanding Systems (2022)8.35
- Generating Synthetic Audio Data For Attention-based Speech Recognition Systems (2019)12.68
- You Do Not Need More Data: Improving End-to-end Speech Recognition By Text-to-speech Data Augmentation (2020)11.49
- Llast: Improved End-to-end Speech Translation System Leveraged By Large Language Models (2024)10.67
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- A Simple Baseline For Domain Adaptation In End To End ASR Systems Using Synthetic Data (2022)7.16
- Speech-language Pre-training For End-to-end Spoken Language Understanding (2021)9.41