Quality Over Quantity? Llm-based Curation For A Data-efficient Audio-video Foundation Model
2025 Β· Ali Vosoughi, Dimitra Emmanouilidou, Hannes Gamper
Abstract
Integrating audio and visual data for training multimodal foundational models remains a challenge. The Audio-Video Vector Alignment (AVVA) framework addresses this by considering AV scene alignment beyond mere temporal synchronization, and leveraging Large Language Models (LLMs) for data curation. AVVA implements a scoring mechanism for selecting aligned training data segments. It integrates Whisper, a speech-based foundation model, for audio and DINOv2 for video analysis in a dual-encoder structure with contrastive learning on AV pairs. Evaluations on AudioCaps, VALOR, and VGGSound demonstrate the effectiveness of the proposed model architecture and data curation approach. AVVA achieves a significant improvement in top-k accuracies for video-to-audio retrieval on all datasets compared to DenseAV, while using only 192 hrs of curated training data. Furthermore, an ablation study indicates that the data curation process effectively trades data quality for data quantity, yielding increase
Authors
(none)
Tags
Stats
Related papers
- Acckv: Towards Efficient Audio-video Llms Inference Via Adaptive-focusing And Cross-calibration KV Cache Optimization (2025)0.00
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Audio-enhanced Vision-language Modeling With Latent Space Broadening For High Quality Data Expansion (2025)0.00
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- Diff-foley: Synchronized Video-to-audio Synthesis With Latent Diffusion Models (2023)0.00
- Diffusion Models As Masked Audio-video Learners (2023)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26