Self-supervised Audio-and-text Pre-training With Extremely Low-resource Parallel Data
2022 Β· Yu Kang, Tianqiao Liu, Hang Li, et al.
Abstract
Multimodal pre-training for audio-and-text has recently been proved to be effective and has significantly improved the performance of many downstream speech understanding tasks. However, these state-of-the-art pre-training audio-text models work well only when provided with large amount of parallel audio-and-text data, which brings challenges on many languages that are rich in unimodal corpora but scarce of parallel cross-modal corpus. In this paper, we investigate whether it is possible to pre-train an audio-text multimodal model with extremely low-resource parallel data and extra non-parallel unimodal data. Our pre-training framework consists of the following components: (1) Intra-modal Denoising Auto-Encoding (IDAE), which is able to reconstruct input text (audio) representations from a noisy version of itself. (2) Cross-modal Denoising Auto-Encoding (CDAE), which is pre-trained to reconstruct the input text (audio), given both a noisy version of the input text (audio) and the corre
Authors
(none)
Tags
Stats
Related papers
- BLAT: Bootstrapping Language-audio Pre-training Based On Audioset Tag-guided Synthetic Data (2023)8.35
- Self-supervised Learning Based Monaural Speech Enhancement With Multi-task Pre-training (2021)0.00
- Speechlm: Enhanced Speech Pre-training With Unpaired Textual Data (2022)0.00
- Leveraging Unimodal Self-supervised Learning For Multimodal Audio-visual Speech Recognition (2022)11.29
- Connecting The Dots Between Audio And Text Without Parallel Data Through Visual Knowledge Transfer (2021)8.09
- Learning Audio-video Modalities From Image Captions (2022)12.54
- Almost Unsupervised Text To Speech And Automatic Speech Recognition (2019)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00