Label-synchronous Speech-to-text Alignment For ASR Using Forward And Backward Transformers
2021 Β· Yusuke Kida, Tatsuya Komatsu, Masahito Togami
Abstract
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR). The speech-to-text alignment is a problem of splitting long audio recordings with un-aligned transcripts into utterance-wise pairs of speech and text. Unlike conventional methods based on frame-synchronous prediction, the proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. This enables an accurate alignment benefiting from the strong inference ability of the state-of-the-art attention-based encoder-decoder models, which cannot be applied to the conventional methods. Two different Transformer models named forward Transformer and backward Transformer are respectively used for estimating an initial and final tokens of a given speech segment based on end-of-sentence prediction with teacher-forcing. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance
Authors
(none)
Tags
Stats
Related papers
- Label-synchronous Neural Transducer For Adaptable Online E2E Speech Recognition (2023)3.58
- Combining Frame-synchronous And Label-synchronous Systems For Speech Recognition (2021)0.00
- Synchronous Transformers For End-to-end Speech Recognition (2019)12.02
- Iterative Pseudo-forced Alignment By Acoustic CTC Loss For Self-supervised ASR Domain Adaptation (2022)0.00
- Investigating The Effect Of Label Topology And Training Criterion On ASR Performance And Alignment Quality (2024)0.00
- Transformer-transducers For Code-switched Speech Recognition (2020)10.97
- Almost Unsupervised Text To Speech And Automatic Speech Recognition (2019)0.00
- A CTC Alignment-based Non-autoregressive Transformer For End-to-end Automatic Speech Recognition (2023)10.97