Audio-to-image Bird Species Retrieval Without Audio-image Pairs Via Text Distillation
2026 Β· Ilyass Moummad, Marius Miron, Lukas Rauch, et al.
Abstract
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substanti
Authors
(none)
Tags
Stats
Related papers
- Clipsonic: Text-to-audio Synthesis With Unlabeled Videos And Pretrained Language-vision Models (2023)9.03
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Leveraging Pretrained Image-text Models For Improving Audio-visual Learning (2023)0.00
- Brewclip: A Bifurcated Representation Learning Framework For Audio-visual Retrieval (2024)0.00
- Text-based Audio Retrieval By Learning From Similarities Between Audio Captions (2024)2.26
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76
- Audio Representation Learning By Distilling Video As Privileged Information (2023)0.00