Multi-quartznet: Multi-resolution Convolution For Speech Recognition With Multi-layer Feature Fusion
2020 Β· Jian Luo, Jianzong Wang, Ning Cheng, et al.
Abstract
In this paper, we propose an end-to-end speech recognition network based on Nvidia's previous QuartzNet model. We try to promote the model performance, and design three components: (1) Multi-Resolution Convolution Module, replaces the original 1D time-channel separable convolution with multi-stream convolutions. Each stream has a unique dilated stride on convolutional operations. (2) Channel-Wise Attention Module, calculates the attention weight of each convolutional stream by spatial channel-wise pooling. (3) Multi-Layer Feature Fusion Module, reweights each convolutional block by global multi-layer feature maps. Our experiments demonstrate that Multi-QuartzNet model achieves CER 6.77% on AISHELL-1 data set, which outperforms original QuartzNet and is close to state-of-art result.
Authors
(none)
Tags
Stats
Related papers
- Qifusion-net: Layer-adapted Stream/non-stream Model For End-to-end Multi-accent Speech Recognition (2024)3.58
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Spatialnet: Extensively Learning Spatial Information For Multichannel Joint Speech Separation, Denoising And Dereverberation (2023)13.88
- Multi-scale Feature Fusion Transformer Network For End-to-end Single Channel Speech Separation (2022)0.00
- Speech Enhancement Using Multi-stage Self-attentive Temporal Convolutional Networks (2021)14.15
- 3M: Multi-loss, Multi-path And Multi-level Neural Networks For Speech Recognition (2022)8.67
- Overlapped Speech Recognition From A Jointly Learned Multi-channel Neural Speech Extraction And Representation (2019)0.00
- Multi-encoder Multi-resolution Framework For End-to-end Speech Recognition (2018)0.00