Multimodal Large Language Models With Fusion Low Rank Adaptation For Device Directed Speech Detection
2024 Β· Shruti Palaskar, Oggi Rudovic, Sameer Dharur, et al.
Abstract
Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.
Authors
(none)
Tags
Stats
Related papers
- A Multimodal Approach To Device-directed Speech Detection With Large Language Models (2024)7.16
- Multilingual And Fully Non-autoregressive ASR With Large Language Model Fusion: A Comprehensive Study (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- It's Never Too Late: Fusing Acoustic Information Into Large Language Models For Automatic Speech Recognition (2024)0.00
- Prompting Large Language Models For Zero-shot Domain Adaptation In Speech Recognition (2023)0.00