Towards Unsupervised Speech-to-text Translation
2018 Β· Yu-An Chung, Wei-Hung Weng, Schrasing Tong, et al.
Abstract
We present a framework for building speech-to-text translation (ST) systems using only monolingual speech and text corpora, in other words, speech utterances from a source language and independent text from a target language. As opposed to traditional cascaded systems and end-to-end architectures, our system does not require any labeled data (i.e., transcribed source audio or parallel source and target text corpora) during training, making it especially applicable to language pairs with very few or even zero bilingual resources. The framework initializes the ST system with a cross-modal bilingual dictionary inferred from the monolingual corpora, that maps every source speech segment corresponding to a spoken word to its target text translation. For unseen source speech utterances, the system first performs word-by-word translation on each speech segment in the utterance. The translation is improved by leveraging a language model and a sequence denoising autoencoder to provide prior kno
Authors
(none)
Tags
Stats
Related papers
- Multilingual End-to-end Speech Translation (2019)0.00
- Textless Speech-to-speech Translation With Limited Parallel Data (2023)3.58
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Towards Speech-to-text Translation Without Speech Recognition (2017)10.35
- Rosettaspeech: Zero-shot Speech-to-speech Translation Without Parallel Speech (2025)0.00
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76