Cycle-consistency Training For End-to-end Speech Recognition
2018 Β· Takaaki Hori, Ramon Astudillo, Tomoki Hayashi, et al.
Abstract
This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data. Although the end-to-end approach can eliminate the need for expert knowledge such as pronunciation dictionaries to build ASR systems, it still requires a large amount of paired data, i.e., speech utterances and their transcriptions. Cycle-consistency losses have been recently proposed as a way to mitigate the problem of limited paired data. These approaches compose a reverse operation with a given transformation, e.g., text-to-speech (TTS) with ASR, to build a loss that only requires unsupervised data, speech in this example. Applying cycle consistency to ASR models is not trivial since fundamental information, such as speaker traits, are lost in the intermediate text bottleneck. To solve this problem, this work presents a loss that is based on the speech encoder state sequence instead of the raw speech signal. This is achieved by training a Text-To-Encoder model and defining
Authors
(none)
Tags
Stats
Related papers
- Speaker Consistency Loss And Step-wise Optimization For Semi-supervised Joint Training Of TTS And ASR Using Unpaired Text Data (2022)0.00
- Semi-supervised Sequence-to-sequence ASR Using Unpaired Speech And Text (2019)0.00
- Improved Consistency Training For Semi-supervised Sequence-to-sequence ASR Via Speech Chain Reconstruction And Self-transcribing (2022)0.00
- Enhanced Exemplar Autoencoder With Cycle Consistency Loss In Any-to-one Voice Conversion (2022)0.00
- Optimizing Voice Conversion Network With Cycle Consistency Loss Of Speaker Identity (2020)9.59
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Improving Noisy Student Training For Low-resource Languages In End-to-end ASR Using Cyclegan And Inter-domain Losses (2024)0.00
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97