Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models
2023 Β· Guangzhi Sun, Wenyi Yu, Changli Tang, et al.
Abstract
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-v
Authors
(none)
Tags
Stats
Related papers
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Adapting Speech Foundation Models For Unified Multimodal Speech Recognition With Large Language Models (2025)0.00
- Quality Over Quantity? Llm-based Curation For A Data-efficient Audio-video Foundation Model (2025)0.00
- Multimodal Large Language Models With Fusion Low Rank Adaptation For Device Directed Speech Detection (2024)0.00
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Acckv: Towards Efficient Audio-video Llms Inference Via Adaptive-focusing And Cross-calibration KV Cache Optimization (2025)0.00
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00