A Multi-level Acoustic Feature Extraction Framework For Transformer Based End-to-end Speech Recognition
2021 Β· Jin Li, Rongfeng Su, Xurong Xie, et al.
Abstract
Transformer based end-to-end modelling approaches with multiple stream inputs have been achieved great success in various automatic speech recognition (ASR) tasks. An important issue associated with such approaches is that the intermediate features derived from each stream might have similar representations and thus it is lacking of feature diversity, such as the descriptions related to speaker characteristics. To address this issue, this paper proposed a novel multi-level acoustic feature extraction framework that can be easily combined with Transformer based ASR models. The framework consists of two input streams: a shallow stream with high-resolution spectrograms and a deep stream with low-resolution spectrograms. The shallow stream is used to acquire traditional shallow features that is beneficial for the classification of phones or words while the deep stream is used to obtain utterance-level speaker-invariant deep features for improving the feature diversity. A feature correlatio
Authors
(none)
Tags
Stats
Related papers
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- Transformer Transducer: One Model Unifying Streaming And Non-streaming Speech Recognition (2020)0.00
- Streaming Transformer-based Acoustic Models Using Self-attention With Augmented Memory (2020)0.00
- Multi-scale Feature Fusion Transformer Network For End-to-end Single Channel Speech Separation (2022)0.00
- Transformer-based Acoustic Modeling For Hybrid Speech Recognition (2019)16.30
- Echotune: A Modular Extractor Leveraging The Variable-length Nature Of Speech In ASR Tasks (2023)0.00
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- End-to-end Multichannel Speaker-attributed ASR: Speaker Guided Decoder And Input Feature Analysis (2023)0.00