A Multimodal Approach To Device-directed Speech Detection With Large Language Models
2024 Β· Dominik Wagner, Alexander Churchill, Siddharth Sigtia, et al.
Abstract
Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
Authors
(none)
Tags
Stats
Related papers
- SELMA: A Speech-enabled Language Model For Virtual Assistant Interactions (2025)2.26
- Multimodal Large Language Models With Fusion Low Rank Adaptation For Device Directed Speech Detection (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- Device-directed Utterance Detection (2018)10.35
- Tiny-align: Bridging Automatic Speech Recognition And Large Language Model On The Edge (2024)0.00
- Large Language Model Can Transcribe Speech In Multi-talker Scenarios With Versatile Instructions (2024)11.23