WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM
2025 Β· Changli Tang, Qinfan Xiao, Ke Mei, et al.
Abstract
While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf\{u\}nified \& \textbf\{v\}ersatile \textbf\{a\}udio-\textbf\{v\}isual \textbf\{e\}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate o
Authors
(none)
Tags
Stats
Related papers
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Vela: Scalable Embeddings With Voice Large Language Models For Multimodal Retrieval (2025)4.52
- Vill-e: Video LLM Embeddings For Retrieval (2026)0.00
- Magic-mm-embedding: Towards Visual-token-efficient Universal Multimodal Embedding With Mllms (2026)0.00
- Analyzing Diffusion And Autoregressive Vision Language Models In Multimodal Embedding Space (2026)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84