Prepending Or Cross-attention For Speech-to-text? An Empirical Comparison
2025 Β· Tsz Kin Lam, Marco Gaido, Sara Papi, et al.
Abstract
Following the remarkable success of Large Language Models (LLMs) in NLP tasks, there is increasing interest in extending their capabilities to speech -- the most common form of communication. The most widespread approach to integrating speech into LLMs is dense feature prepending (DFP), which prepends the projected speech representations to the textual representations, allowing end-to-end training with a speech encoder. This raises questions about the need for a sophisticated speech encoder for DFP and how its performance compares with a standard encoder-decoder (i.e., cross-attention) architecture. We compare DFP and cross-attention under a variety of configurations, such as CTC compression, sequence-level knowledge distillation, on monolingual, bilingual, and multilingual models. To perform a controlled architectural comparison, we train all models from scratch rather than using large pretrained models and use comparable data and parameter settings, testing speech-to-text recognition
Authors
(none)
Tags
Stats
Related papers
- A Comparison Of Techniques For Language Model Integration In Encoder-decoder Speech Recognition (2018)14.39
- On Decoder-only Architecture For Speech-to-text And Large Language Model Integration (2023)0.00
- Adapting Large Language Model With Speech For Fully Formatted End-to-end Speech Recognition (2023)0.00
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- Decoder-only Architecture For Speech Recognition With CTC Prompts And Text Data Augmentation (2023)0.00
- Investigating Decoder-only Large Language Models For Speech-to-text Translation (2024)0.00
- A Comprehensive Solution To Connect Speech Encoder And Large Language Model For ASR (2024)0.00