Mmspeech: Multi-modal Multi-task Encoder-decoder Pre-training For Speech Recognition
2022 Β· Xiaohuan Zhou, Jiaming Wang, Zeyu Cui, et al.
Abstract
In this paper, we propose a novel multi-modal multi-task encoder-decoder pre-training framework (MMSpeech) for Mandarin automatic speech recognition (ASR), which employs both unlabeled speech and text data. The main difficulty in speech-text joint pre-training comes from the significant difference between speech and text modalities, especially for Mandarin speech and text. Unlike English and other languages with an alphabetic writing system, Mandarin uses an ideographic writing system where character and sound are not tightly mapped to one another. Therefore, we propose to introduce the phoneme modality into pre-training, which can help capture modality-invariant information between Mandarin speech and text. Specifically, we employ a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. For end-to-end pre-training, we introduce self-supervised speech-to-pseudo-codes (S2C) and phoneme-to-text (P2T) tasks utilizing unlabeled speech a
Authors
(none)
Tags
Stats
Related papers
- Discrete Multimodal Transformers With A Pretrained Large Language Model For Mixed-supervision Speech Processing (2024)0.00
- Speecht5: Unified-modal Encoder-decoder Pre-training For Spoken Language Processing (2021)6.32
- Attention-based End-to-end Speech Recognition On Voice Search (2017)0.00
- Pre-training Transformer Decoder For End-to-end ASR Model With Unpaired Speech Data (2022)13.47
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Multi-modal Data Augmentation For End-to-end ASR (2018)11.67
- E2e-based Multi-task Learning Approach To Joint Speech And Accent Recognition (2021)0.00
- Self-supervised Learning Based Monaural Speech Enhancement With Multi-task Pre-training (2021)0.00