Two-stage Augmentation And Adaptive CTC Fusion For Improved Robustness Of Multi-stream End-to-end ASR
2021 Β· Ruizhi Li, Gregory Sell, Hynek Hermansky
Abstract
Performance degradation of an Automatic Speech Recognition (ASR) system is commonly observed when the test acoustic condition is different from training. Hence, it is essential to make ASR systems robust against various environmental distortions, such as background noises and reverberations. In a multi-stream paradigm, improving robustness takes account of handling a variety of unseen single-stream conditions and inter-stream dynamics. Previously, a practical two-stage training strategy was proposed within multi-stream end-to-end ASR, where Stage-2 formulates the multi-stream model with features from Stage-1 Universal Feature Extractor (UFE). In this paper, as an extension, we introduce a two-stage augmentation scheme focusing on mismatch scenarios: Stage-1 Augmentation aims to address single-stream input varieties with data augmentation techniques; Stage-2 Time Masking applies temporal masks on UFE features of randomly selected streams to simulate diverse stream combinations. During i
Authors
(none)
Tags
Stats
Related papers
- An Investigation Of Enhancing CTC Model For Triggered Attention-based Streaming ASR (2021)0.00
- Improved Mask-ctc For Non-autoregressive End-to-end ASR (2020)11.76
- U2++: Unified Two-pass Bidirectional End-to-end Model For Speech Recognition (2021)0.00
- Mask-ctc-based Encoder Pre-training For Streaming End-to-end Speech Recognition (2023)0.00
- Audio Adversarial Examples For Robust Hybrid Ctc/attention Speech Recognition (2020)3.58
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Joint Optimization Of Streaming And Non-streaming Automatic Speech Recognition With Multi-decoder And Knowledge Distillation (2024)0.00
- Speech Enhancement Using Multi-stage Self-attentive Temporal Convolutional Networks (2021)14.15