Textless Speech-to-speech Translation With Limited Parallel Data
2023 Β· Anuj Diwan, Anirudh Srinivasan, David Harwath, et al.
Abstract
Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topli
Authors
(none)
Tags
Stats
Related papers
- Textless Speech-to-speech Translation On Real Data (2021)13.65
- Joint Pre-training With Speech And Bilingual Text For Direct Speech To Speech Translation (2022)7.81
- Rosettaspeech: Zero-shot Speech-to-speech Translation Without Parallel Speech (2025)0.00
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35
- Towards Unsupervised Speech-to-text Translation (2018)0.00
- Textless Direct Speech-to-speech Translation With Discrete Speech Representation (2022)9.76
- Enhanced Direct Speech-to-speech Translation Using Self-supervised Pre-training And Data Augmentation (2022)10.85
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05