BLAT: Bootstrapping Language-audio Pre-training Based On Audioset Tag-guided Synthetic Data
2023 Β· Xuenan Xu, Zhiling Zhang, Zelin Zhou, et al.
Abstract
Compared with ample visual-text pre-training research, few works explore audio-text pre-training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods incorporate the visual modality as a pivot for audio-text pre-training, which inevitably induces data noise. In this paper, we propose to utilize audio captioning to generate text directly from audio, without the aid of the visual modality so that potential noise from modality mismatch is eliminated. Furthermore, we propose caption generation under the guidance of AudioSet tags, leading to more accurate captions. With the above two improvements, we curate high-quality, large-scale parallel audio-text data, based on which we perform audio-text pre-training. We comprehensively demonstrate the performance of the pre-trained model on a series of downstream audio-related tasks, including single-modality tasks like audio classification and tagging, as well as cross-modal tasks consisting of audio-text retrieval
Authors
(none)
Tags
Stats
Related papers
- Audio Captioning Using Pre-trained Large-scale Language Model Guided By Audio-based Similar Caption Retrieval (2020)0.00
- Advancing Natural-language Based Audio Retrieval With Passt And Large Audio-caption Data Sets (2023)0.00
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)19.60
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Self-supervised Audio-and-text Pre-training With Extremely Low-resource Parallel Data (2022)3.81
- From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data (2025)2.26
- Clipsonic: Text-to-audio Synthesis With Unlabeled Videos And Pretrained Language-vision Models (2023)9.03