Languages In Whisper-style Speech Encoders Align Both Phonetically And Semantically
2025 · Ryan Soh-Eun Shim, Domenico de Cristofaro, Chengzhi Martin Hu, et al.
Abstract
Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational similarity. However, prior work does not control for phonetic overlap between equivalent utterances, which may artificially support retrieval. We conduct pronunciation-controlled experiments to test whether cross-lingual alignment arises from semantic rather than phonetic similarity. Results show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. We further test early-exiting the encoder to induce representations we hypothesize to be less tied to language-specific semantics. These experiments indeed reveal performance gains in automatic speech recognition on low-resource languages unseen du
Authors
(none)
Tags
Stats
Related papers
- Cross-lingual Transfer Learning For Speech Translation (2024)6.34
- Using Joint Training Speaker Encoder With Consistency Loss To Achieve Cross-lingual Voice Conversion And Expressive Voice Conversion (2023)0.00
- Investigating The Impact Of Cross-lingual Acoustic-phonetic Similarities On Multilingual Speech Recognition (2022)3.58
- Weighted Cross-entropy For Low-resource Languages In Multilingual Speech Recognition (2024)6.34
- Cross-lingual Low Resource Speaker Adaptation Using Phonological Features (2021)5.24
- Whisper Speaker Identification: Leveraging Pre-trained Multilingual Transformers For Robust Speaker Embeddings (2025)0.00
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Aligning Speech To Languages To Enhance Code-switching Speech Recognition (2024)5.84