Alignformer: Modality Matching Can Achieve Better Zero-shot Instruction-following Speech-llm
2024 Β· Ruchao Fan, Bo Ren, Yuxuan Hu, et al.
Abstract
Integrating speech into LLM (speech-LLM) has gaining increased attention recently. The mainstream solution is to connect a well-trained speech encoder and LLM with a neural adapter. However, the length mismatch between the speech and text sequences are not well handled, leading to imperfect modality matching between the speech and text. In this work, we propose a novel neural adapter, AlignFormer, to reduce the length gap between the two modalities. AlignFormer consists of CTC and dynamic-window QFormer layers, where the CTC alignment provides the dynamic window information for QFormer. The LLM backbone is frozen in training to preserve its text capability, especially the instruction following capability. When training with ASR data only, the proposed AlignFormer unlocks the instruction following capability for speech-LLM and the model can perform zero-shot speech translation (ST) and speech question answering (SQA) tasks. In fact, speech-LLM with AlignFormer can theoretically perform
Authors
(none)
Tags
Stats
Related papers
- Ideal-llm: Integrating Dual Encoders And Language-adapted LLM For Multilingual Speech-to-text (2024)5.24
- A Comprehensive Solution To Connect Speech Encoder And Large Language Model For ASR (2024)0.00
- Harnessing The Zero-shot Power Of Instruction-tuned Large Language Model In End-to-end Speech Recognition (2023)0.00
- TTA: Transcribe, Translate And Alignment For Cross-lingual Speech Representation (2025)0.00
- Hearing To Translate: The Effectiveness Of Speech Modality Integration Into Llms (2026)0.00
- Improving Robustness Of Llm-based Speech Synthesis By Learning Monotonic Alignment (2024)0.00
- Zero-resource Speech Translation And Recognition With Llms (2024)3.58
- Optimizing Alignment Of Speech And Language Latent Spaces For End-to-end Speech Recognition And Understanding (2021)9.03