SELMA: A Speech-enabled Language Model For Virtual Assistant Interactions
2025 Β· Dominik Wagner, Alexander Churchill, Siddharth Sigtia, et al.
Abstract
In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative
Authors
(none)
Tags
Stats
Related papers
- A Multimodal Approach To Device-directed Speech Detection With Large Language Models (2024)7.16
- SELM: Speech Enhancement Using Discrete Tokens And Language Models (2023)11.19
- Audiochatllama: Towards General-purpose Speech Abilities For Llms (2023)9.41
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- SALM: Speech-augmented Language Model With In-context Learning For Speech Recognition And Translation (2023)11.29
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- ELLA-V: Stable Neural Codec Language Modeling With Alignment-guided Sequence Reordering (2024)0.00
- Server-side Rescoring Of Spoken Entity-centric Knowledge Queries For Virtual Assistants (2023)0.00