Echotune: A Modular Extractor Leveraging The Variable-length Nature Of Speech In ASR Tasks
2023 Β· Sizhou Chen, Songyang Gao, Sen Fang
Abstract
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgam
Authors
(none)
Tags
Stats
Related papers
- A Multi-level Acoustic Feature Extraction Framework For Transformer Based End-to-end Speech Recognition (2021)0.00
- Transformer-based Online Ctc/attention End-to-end Speech Recognition Architecture (2020)14.06
- S-transformer: Segment-transformer For Robust Neural Speech Synthesis (2020)0.00
- Transformer-based Online Speech Recognition With Decoder-end Adaptive Computation Steps (2020)7.81
- Improving Transformer-based Conversational ASR By Inter-sentential Attention Mechanism (2022)7.50
- Attention-based ASR With Lightweight And Dynamic Convolutions (2019)9.03
- Attentron: Few-shot Text-to-speech Utilizing Attention-based Variable-length Embedding (2020)12.02
- Transformer-transducers For Code-switched Speech Recognition (2020)10.97