Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer
2021 Β· Yanpeng Zhao, Jack Hessel, Youngjae Yu, et al.
Abstract
Machines that can represent and describe environmental soundscapes have practical potential, e.g., for audio tagging and captioning systems. Prevailing learning paradigms have been relying on parallel audio-text data, which is, however, scarcely available on the web. We propose VIP-ANT that induces \textbf\{A\}udio-\textbf\{T\}ext alignment without using any parallel audio-text data. Our key idea is to share the image modality between bi-modal image-text representations and bi-modal image-audio representations; the image modality functions as a pivot and connects audio and text in a tri-modal embedding space implicitly. In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2.2% R@1. We further investigate cases of minimal audio-text supervision, finding that, e
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Audio Visual Segmentation Through Text Embeddings (2025)1.81
- Clipsonic: Text-to-audio Synthesis With Unlabeled Videos And Pretrained Language-vision Models (2023)9.03
- Audio-to-image Bird Species Retrieval Without Audio-image Pairs Via Text Distillation (2026)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Self-supervised Audio-and-text Pre-training With Extremely Low-resource Parallel Data (2022)3.81
- Taming Text-to-sounding Video Generation Via Advanced Modality Condition And Interaction (2025)0.00
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00