Multi-modal Data Augmentation For End-to-end ASR
2018 Β· Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, et al.
Abstract
We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using *symbolic* input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input and enables seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on character error rate (CER), and as much as 7-10% relative word error rate (WER) improvement over a baseline both with and without an external language model.
Authors
(none)
Tags
Stats
Related papers
- Improving Multimodal Speech Recognition By Data Augmentation And Speech Representations (2022)9.03
- Data Augmentation For End-to-end Code-switching Speech Recognition (2020)9.92
- On-the-fly Aligned Data Augmentation For Sequence-to-sequence ASR (2021)9.23
- Acoustic Data-driven Subword Modeling For End-to-end Speech Recognition (2021)6.77
- META-CAT: Speaker-informed Speech Embeddings Via Meta Information Concatenation For Multi-talker ASR (2024)3.58
- Back-translation-style Data Augmentation For End-to-end ASR (2018)13.11
- Improving Code-switching And Named Entity Recognition In ASR With Speech Editing Based Data Augmentation (2023)6.34
- Mixspeech: Data Augmentation For Low-resource Automatic Speech Recognition (2021)13.60