Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing
2024 Β· Viet Anh Trinh, Rosy Southwell, Yiwen Guan, et al.
Abstract
Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
Authors
(none)
Tags
Stats
Related papers
- Mixture-of-transformers: A Sparse And Scalable Architecture For Multi-modal Foundation Models (2024)0.00
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- Making Llms Better Many-to-many Speech-to-text Translators With Curriculum Learning (2024)7.31
- Teaching A Multilingual Large Language Model To Understand Multilingual Speech Via Multi-instructional Training (2024)0.00
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23