Exploring Phoneme-level Speech Representations For End-to-end Speech Translation
2019 Β· Elizabeth Salesky, Matthias Sperber, Alan W Black
Abstract
Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.
Authors
(none)
Tags
Stats
Related papers
- Allost: Low-resource Speech Translation Without Source Transcription (2021)7.81
- Leveraging Translations For Speech Transcription In Low-resource Settings (2018)6.77
- Leveraging Weakly Supervised Data To Improve End-to-end Speech-to-text Translation (2018)13.05
- Multilingual End-to-end Speech Translation (2019)0.00
- Multilingual Byte2speech Models For Scalable Low-resource Speech Synthesis (2021)0.00
- Speechformer: Reducing Information Loss In Direct Speech Translation (2021)7.16
- Sample, Translate, Recombine: Leveraging Audio Alignments For Data Augmentation In End-to-end Speech Translation (2022)7.81
- Leveraging Unsupervised And Weakly-supervised Data To Improve Direct Speech-to-speech Translation (2022)8.35