WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM
2025 Β· Changli Tang, Qinfan Xiao, Ke Mei, et al.
Abstract
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf\{u\}nified \& \textbf\{v\}ersatile \textbf\{a\}udio-\textbf\{v\}isual \textbf\{e\}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate o
Authors
(none)
Tags
Stats
Related papers
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner (2024)0.00
- Mowe-audio: Multitask Audiollms With Mixture Of Weak Encoders (2024)3.58
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- TEAL: Tokenize And Embed ALL For Multi-modal Large Language Models (2023)0.00
- Macaw-llm: Multi-modal Language Modeling With Image, Audio, Video, And Text Integration (2023)0.00
- Exploring Efficient-tuned Learning Audio Representation Method From Brivl (2023)0.00