End-to-end Target Speaker Speech Recognition Using Context-aware Attention Mechanisms For Challenging Enrollment Scenario
2025 Β· Mohsen Ghane, Mohammad Sadegh Safari
Abstract
This paper presents a novel streaming end-to-end target-speaker speech recognition that addresses two critical limitations in systems: the handling of noisy enrollment utterances and specific enrollment phrase requirements. This paper proposes a robust Target-Speaker Recurrent Neural Network Transducer (TS-RNNT) with dual attention mechanisms for contextual biasing and overlapping enrollment processing. The model incorporates a text decoder and attention mechanism specifically designed to extract relevant speaker characteristics from noisy, overlapping enrollment audio. Experimental results on a synthesized dataset demonstrate the model's resilience, maintaining a Word Error Rate (WER) of 16.44% even with overlapping enrollment at 5dB Signal-to-Interference Ratio (SIR), compared to conventional approaches that degrade to WERs above 75% under similar conditions. This significant performance improvement, coupled with the model's semi-text-dependent enrollment capabilities, represents a s
Authors
(none)
Tags
Stats
Related papers
- Target Speaker Extraction By Directly Exploiting Contextual Information In The Time-frequency Domain (2024)9.59
- Towards A Competitive End-to-end Speech Recognition For Chime-6 Dinner Party Transcription (2020)6.77
- Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers (2021)10.07
- Dualstream Contextual Fusion Network: Efficient Target Speaker Extraction By Leveraging Mixture And Enrollment Interactions (2025)0.00
- Lightweight Speech Enhancement Guided Target Speech Extraction In Noisy Multi-speaker Scenarios (2025)0.00
- Contextualized Streaming End-to-end Speech Recognition With Trie-based Deep Biasing And Shallow Fusion (2021)13.44
- Joint Ctc-attention Based End-to-end Speech Recognition Using Multi-task Learning (2016)20.43
- Exploring Architectures, Data And Units For Streaming End-to-end Speech Recognition With Rnn-transducer (2018)16.21