CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations
2022 Β· Vin Sachidananda, Shao-Yen Tseng, Erik Marchi, et al.
Abstract
Deriving multimodal representations of audio and lexical inputs is a central problem in Natural Language Understanding (NLU). In this paper, we present Contrastive Aligned Audio-Language Multirate and Multimodal Representations (CALM), an approach for learning multimodal representations using contrastive and multirate information inherent in audio and lexical inputs. The proposed model aligns acoustic and lexical information in the input embedding space of a pretrained language-only contextual embedding model. By aligning audio representations to pretrained language representations and utilizing contrastive information between acoustic inputs, CALM is able to bootstrap audio embedding competitive with existing audio representation models in only a few hours of training time. Operationally, audio spectrograms are processed using linearized patches through a Spectral Transformer (SpecTran) which is trained using a Contrastive Audio-Language Pretraining objective to align audio and langua
Authors
(none)
Tags
Stats
Related papers
- Continuous Audio Language Models (2025)0.00
- CTAL: Pre-training Cross-modal Transformer For Audio-and-language Representations (2021)7.50
- CACARA: Cross-modal Alignment Leveraging A Text-centric Approach For Cost-effective Multimodal And Multilingual Learning (2025)0.00
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- Collap: Contrastive Long-form Language-audio Pretraining With Musical Temporal Structure Augmentation (2024)3.58
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Cross-modal Contrastive Representation Learning For Audio-to-image Generation (2022)0.00