Zero-resource Speech Translation And Recognition With Llms
2024 Β· Karel Mundnich, Xing Niu, Prashant Mathur, et al.
Abstract
Despite recent advancements in speech processing, zero-resource speech translation (ST) and automatic speech recognition (ASR) remain challenging problems. In this work, we propose to leverage a multilingual Large Language Model (LLM) to perform ST and ASR in languages for which the model has never seen paired audio-text data. We achieve this by using a pre-trained multilingual speech encoder, a multilingual LLM, and a lightweight adaptation module that maps the audio representations to the token embedding space of the LLM. We perform several experiments both in ST and ASR to understand how to best train the model and what data has the most impact on performance in previously unseen languages. In ST, our best model is capable to achieve BLEU scores over 23 in CoVoST2 for two previously unseen languages, while in ASR, we achieve WERs of up to 28.2%. We finally show that the performance of our system is bounded by the ability of the LLM to output text in the desired language.
Authors
(none)
Tags
Stats
Related papers
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Llast: Improved End-to-end Speech Translation System Leveraged By Large Language Models (2024)10.67
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Prompting Large Language Models For Zero-shot Domain Adaptation In Speech Recognition (2023)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23
- Harnessing The Zero-shot Power Of Instruction-tuned Large Language Model In End-to-end Speech Recognition (2023)0.00
- Chain-of-thought Prompting For Speech Translation (2024)6.34
- Tackling Data Scarcity In Speech Translation Using Zero-shot Multilingual Machine Translation Techniques (2022)2.26