Clipsonic: Text-to-audio Synthesis With Unlabeled Videos And Pretrained Language-vision Models
2023 Β· Hao-Wen Dong, Xiaoyu Liu, Jordi Pons, et al.
Abstract
Recent work has studied text-to-audio synthesis using large amounts of paired text-audio data. However, audio recordings with high-quality text annotations can be difficult to acquire. In this work, we approach text-to-audio synthesis using unlabeled videos and pretrained language-vision models. We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining (CLIP) model. At test time, we first explore performing a zero-shot modality transfer and condition the diffusion model with a CLIP-encoded text query. However, we observe a noticeable performance drop with respect to image queries. To close this gap, we further adopt a pretrained diffusion prior model to generate a CLIP image embedding given a CLIP text embedding. Our results show the effectiveness of the proposed method, and that
Authors
(none)
Tags
Stats
Related papers
- Leveraging Pretrained Image-text Models For Improving Audio-visual Learning (2023)0.00
- Speechclip: Integrating Speech With Pre-trained Vision And Language Model (2022)9.92
- Audio-to-image Bird Species Retrieval Without Audio-image Pairs Via Text Distillation (2026)0.00
- Audiotoken: Adaptation Of Text-conditioned Diffusion Models For Audio-to-image Generation (2023)9.76
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Interactive Audio-text Representation For Automated Audio Captioning With Contrastive Learning (2022)0.00
- Lipvoicer: Generating Speech From Silent Videos Guided By Lip Reading (2023)3.89