Sviqa: A Unified Speech-vision Multimodal Model For Textless Visual Question Answering
2025 Β· Bingxin Li
Abstract
Multimodal models integrating speech and vision hold significant potential for advancing human-computer interaction, particularly in Speech-Based Visual Question Answering (SBVQA) where spoken questions about images require direct audio-visual understanding. Existing approaches predominantly focus on text-visual integration, leaving speech-visual modality gaps underexplored due to their inherent heterogeneity. To this end, we introduce SViQA, a unified speech-vision model that directly processes spoken questions without text transcription. Building upon the LLaVA architecture, our framework bridges auditory and visual modalities through two key innovations: (1) end-to-end speech feature extraction eliminating intermediate text conversion, and (2) cross-modal alignment optimization enabling effective fusion of speech signals with visual content. Extensive experimental results on the SBVQA benchmark demonstrate the proposed SViQA's state-of-the-art performance, achieving 75.62% accuracy,
Authors
(none)
Tags
Stats
Related papers
- Learning To Unify Audio, Visual And Text For Audio-enhanced Multilingual Visual Answer Localization (2024)2.26
- Speechbert: An Audio-and-text Jointly Learned Language Model For End-to-end Spoken Question Answering (2019)12.33
- VILAS: Exploring The Effects Of Vision And Language Context In Automatic Speech Recognition (2023)3.58
- AV2AV: Direct Audio-visual Speech To Audio-visual Speech Translation With Unified Audio-visual Speech Representation (2023)6.77
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Self-supervised Contrastive Cross-modality Representation Learning For Spoken Question Answering (2021)9.41
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00