Regularizing End-to-end Speech Translation With Triangular Decomposition Agreement

·2021

arXiv:du2021regularizing ↗Google Scholar ↗Semantic Scholar ↗

Speech Translation Text-to-Speech Speech Recognition

Abstract

End-to-end speech-to-text translation (E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus \(\langle speech, transcription, translation\rangle\), the conventional high-quality E2E-ST system leverages the \(\langle speech, transcription\rangle\) pair to pre-train the model and then utilizes the \(\langle speech, translation\rangle\) pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce tw

Abstract

Related papers