Modality Dropout For Multimodal Device Directed Speech Detection Using Verbal And Non-verbal Features
2023 Β· Gautam Krishna, Sameer Dharur, Oggi Rudovic, et al.
Abstract
Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g acoustic, text and/or automatic speech recognition system (ASR) features, to classify speech as device-directed or otherwise, and often have to contend with one or more of these modalities being unavailable when deployed in real-world settings. In this paper, we investigate fusion schemes for DDSD systems that can be made more robust to missing modalities. Concurrently, we study the use of non-verbal cues, specifically prosody features, in addition to verbal cues for DDSD. We present different approaches to combine scores and embeddings from prosody with the corresponding verbal cues, finding that prosody improves DDSD performance by upto 8.5% in terms of false acceptance rate (FA) at a given fixed operating point via non-linear intermediate fusion, whil
Authors
(none)
Tags
Stats
Related papers
- A Multimodal Approach To Device-directed Speech Detection With Large Language Models (2024)7.16
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- A Study Of Dropout-induced Modality Bias On Robustness To Missing Video Frames For Audio-visual Speech Recognition (2024)9.50
- A Novel Multimodal Dynamic Fusion Network For Disfluency Detection In Spoken Utterances (2022)0.00
- Efficient Audiovisual Speech Processing Via MUTUD: Multimodal Training And Unimodal Deployment (2025)0.00
- Device-directed Utterance Detection (2018)10.35
- Comparative Analysis Of Modality Fusion Approaches For Audio-visual Person Identification And Verification (2024)0.00
- Integrating Audio, Visual, And Semantic Information For Enhanced Multimodal Speaker Diarization (2024)0.00