Audio-enhanced Vision-language Modeling With Latent Space Broadening For High Quality Data Expansion
2025 Β· Yu Sun, Yin Li, Ruixiao Sun, et al.
Abstract
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address t
Authors
(none)
Tags
Stats
Related papers
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03
- From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data (2025)2.26
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Quality Over Quantity? Llm-based Curation For A Data-efficient Audio-video Foundation Model (2025)0.00
- A Better Use Of Audio-visual Cues: Dense Video Captioning With Bi-modal Transformer (2020)10.61