Text-only Domain Adaptation For End-to-end Speech Recognition Through Down-sampling Acoustic Representation
2023 Β· Jiaxu Zhu, Weinan Tong, Yaoxun Xu, et al.
Abstract
Mapping two modalities, speech and text, into a shared representation space, is a research topic of using text-only data to improve end-to-end automatic speech recognition (ASR) performance in new domains. However, the length of speech representation and text representation is inconsistent. Although the previous method up-samples the text representation to align with acoustic modality, it may not match the expected actual duration. In this paper, we proposed novel representations match strategy through down-sampling acoustic representation to align with text modality. By introducing a continuous integrate-and-fire (CIF) module generating acoustic representations consistent with token length, our ASR model can learn unified representations from both modalities better, allowing for domain adaptation using text-only data of the target domain. Experiment results of new domain data demonstrate the effectiveness of the proposed method.
Authors
(none)
Tags
Stats
Related papers
- Text-only Domain Adaptation Using Unified Speech-text Representation In Transducer (2023)4.52
- A Simple Baseline For Domain Adaptation In End To End ASR Systems Using Synthetic Data (2022)7.16
- Exploring Machine Speech Chain For Domain Adaptation And Few-shot Speaker Adaptation (2021)0.00
- A Domain Adaptation Framework For Speech Recognition Systems With Only Synthetic Data (2025)5.24
- MADI: Inter-domain Matching And Intra-domain Discrimination For Cross-domain Speech Recognition (2023)7.50
- Exploring Textual And Speech Information In Dialogue Act Classification With Speaker Domain Adaptation (2018)0.00
- Integrating Text Inputs For Training And Adapting RNN Transducer ASR Models (2022)9.59
- Iterative Pseudo-forced Alignment By Acoustic CTC Loss For Self-supervised ASR Domain Adaptation (2022)0.00