Learning Speech Representation From Contrastive Token-acoustic Pretraining
2023 Β· Chunyu Qiang, Hao Li, Yixin Tian, et al.
Abstract
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" between text and acoustic information, containing information from both modalities. The semantic content is emphasized, while the paralinguistic information such as speaker identity and acoustic details should be de-emphasized. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Contrastive learning is a good method for modeling intermediate representations from two modalities. However, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named
Authors
(none)
Tags
Stats
Related papers
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- Learning Disentangled Speech Representations With Contrastive Learning And Time-invariant Retrieval (2024)5.84
- Automatic Data Augmentation Selection And Parametrization In Contrastive Self-supervised Speech Representation Learning (2022)5.24
- Boosting Multi-speaker Expressive Speech Synthesis With Semi-supervised Contrastive Learning (2023)5.24
- Transfer The Linguistic Representations From TTS To Accent Conversion With Non-parallel Data (2024)6.77
- Towards Transfer Learning For End-to-end Speech Synthesis From Deep Pre-trained Language Models (2019)0.00
- On Scaling Contrastive Representations For Low-resource Speech Recognition (2021)3.58
- Towards Robust Few-shot Class Incremental Learning In Audio Classification Using Contrastive Representation (2024)4.52