Leveraging Acoustic Contextual Representation By Audio-textual Cross-modal Learning For Conversational ASR
2022 Β· Kun Wei, Yike Zhang, Sining Sun, et al.
Abstract
Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed. Thus, the model captures not only the bi-directional context dependencies in a specific modality but al
Authors
(none)
Tags
Stats
Related papers
- Effective Cross-utterance Language Modeling For Conversational Speech Recognition (2021)2.26
- Improving RNN-T ASR Accuracy Using Context Audio (2020)5.84
- Using Previous Acoustic Context To Improve Text-to-speech Synthesis (2020)0.00
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Towards Effective And Compact Contextual Representation For Conformer Transducer Speech Recognition Systems (2023)7.16
- Multimodal Speech Recognition With Unstructured Audio Masking (2020)0.00
- Learning Asr-robust Contextualized Embeddings For Spoken Language Understanding (2019)12.02