Collap: Contrastive Long-form Language-audio Pretraining With Musical Temporal Structure Augmentation
2024 Β· Junda Wu, Warren Li, Zachary Novack, et al.
Abstract
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf\{CoLLAP\}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and en
Authors
(none)
Tags
Stats
Related papers
- Large-scale Contrastive Language-audio Pretraining With Feature Fusion And Keyword-to-caption Augmentation (2022)19.60
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- Human-clap: Human-perception-based Contrastive Language-audio Pretraining (2025)4.52
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations (2022)0.00