Qifusion-net: Layer-adapted Stream/non-stream Model For End-to-end Multi-accent Speech Recognition
2024 Β· Jinming Chen, Jingyi Fang, Yuanzhong Zheng, et al.
Abstract
Currently, end-to-end (E2E) speech recognition methods have achieved promising performance. However, auto speech recognition (ASR) models still face challenges in recognizing multi-accent speech accurately. We propose a layer-adapted fusion (LAF) model, called Qifusion-Net, which does not require any prior knowledge about the target accent. Based on dynamic chunk strategy, our approach enables streaming decoding and can extract frame-level acoustic feature, facilitating fine-grained information fusion. Experiment results demonstrate that our proposed methods outperform the baseline with relative reductions of 22.1\(%\) and 17.2\(%\) in character error rate (CER) across multi accent test datasets on KeSpeech and MagicData-RMAC.
Authors
(none)
Tags
Stats
Related papers
- Layer-wise Fast Adaptation For End-to-end Multi-accent Speech Recognition (2022)9.76
- Multi-quartznet: Multi-resolution Convolution For Speech Recognition With Multi-layer Feature Fusion (2020)6.34
- Unified Streaming And Non-streaming Two-pass End-to-end Model For Speech Recognition (2020)0.00
- Improved Neural Language Model Fusion For Streaming Recurrent Neural Network Transducer (2020)8.82
- Interactive Feature Fusion For End-to-end Noise-robust Speech Recognition (2021)12.10
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- A Multi-level Acoustic Feature Extraction Framework For Transformer Based End-to-end Speech Recognition (2021)0.00
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00