CTAL: Pre-training Cross-modal Transformer For Audio-and-language Representations
2021 Β· Hang Li, Yu Kang, Tianqiao Liu, et al.
Abstract
Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation
Authors
(none)
Tags
Stats
Related papers
- CALM: Contrastive Aligned Audio-language Multirate And Multimodal Representations (2022)0.00
- Cross-lingual Text-to-speech Using Multi-task Learning And Speaker Classifier Joint Training (2022)0.00
- Improving Non-autoregressive End-to-end Speech Recognition With Pre-trained Acoustic And Language Models (2022)10.07
- Audio-enhanced Vision-language Modeling With Latent Space Broadening For High Quality Data Expansion (2025)0.00
- Masked Pre-trained Encoder Base On Joint Ctc-transformer (2020)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Efficient Selective Audio Masked Multimodal Bottleneck Transformer For Audio-video Classification (2024)0.00
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77