On Class Separability Pitfalls In Audio-text Contrastive Zero-shot Learning
2024 Β· Tiago Tavares, Fabio Ayres, Zhepei Wang, et al.
Abstract
Recent advances in audio-text cross-modal contrastive learning have shown its potential towards zero-shot learning. One possibility for this is by projecting item embeddings from pre-trained backbone neural networks into a cross-modal space in which item similarity can be calculated in either domain. This process relies on a strong unimodal pre-training of the backbone networks, and on a data-intensive training task for the projectors. These two processes can be biased by unintentional data leakage, which can arise from using supervised learning in pre-training or from inadvertently training the cross-modal projection using labels from the zero-shot learning evaluation. In this study, we show that a significant part of the measured zero-shot learning accuracy is due to strengths inherited from the audio and text backbones, that is, they are not learned in the cross-modal domain and are not transferred from one modality to another.
Authors
(none)
Tags
Stats
Related papers
- Avgzslnet: Audio-visual Generalized Zero-shot Learning By Reconstructing Label Features From Multi-modal Embeddings (2020)12.10
- Zero-shot Audio Classification Using Image Embeddings (2022)6.34
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09
- HC\(^2\)L: Hybrid And Cooperative Contrastive Learning For Cross-lingual Spoken Language Understanding (2024)4.52
- Zero-shot Multi-speaker Text-to-speech With State-of-the-art Neural Speaker Embeddings (2019)15.67
- Unsupervised Voice-face Representation Learning By Cross-modal Prototype Contrast (2022)10.35
- Contrastive Latent Space Reconstruction Learning For Audio-text Retrieval (2023)3.58
- U-hubert: Unified Mixed-modal Speech Pretraining And Zero-shot Transfer To Unlabeled Modality (2022)5.99