Cross-lingual Transfer Learning For Speech Translation
2024 · Rao Ma, Mengjie Qian, Yassir Fathullah, et al.
Abstract
There has been increasing interest in building multilingual foundation models for NLP and speech research. This paper examines how to expand the speech translation capability of these models with restricted data. Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model. Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space. This shared embedding space can then be leveraged for zero-shot cross-lingual transfer in speech translation. By fine-tuning the Whisper decoder with only English-to-Chinese speech translation data, improved performance for translation to Chinese can be obtained for multiple languages, in addition to English. Furthermore, for languages related to those seen in training it is possible to perform speech translation, despite the model never seeing the language in tr
Authors
(none)
Tags
Stats
Related papers
- Languages In Whisper-style Speech Encoders Align Both Phonetically And Semantically (2025)0.00
- Investigating Zero-shot Generalizability On Mandarin-english Code-switched ASR And Speech-to-text Translation Of Recent Foundation Models With Self-supervision And Weak Supervision (2023)0.00
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- On The Transferability Of Whisper-based Representations For "in-the-wild" Cross-task Downstream Speech Applications (2023)0.00
- Whisper Speaker Identification: Leveraging Pre-trained Multilingual Transformers For Robust Speaker Embeddings (2025)0.00
- Weighted Cross-entropy For Low-resource Languages In Multilingual Speech Recognition (2024)6.34
- Dq-whisper: Joint Distillation And Quantization For Efficient Multilingual Speech Recognition (2023)4.52