Cross-attention Is All You Need: Real-time Streaming Transformers For Personalised Speech Enhancement
2022 Β· Shucong Zhang, Malcolm Chadwick, Alberto Gil C. P. Ramos, et al.
Abstract
Personalised speech enhancement (PSE), which extracts only the speech of a target user and removes everything else from a recorded audio clip, can potentially improve users' experiences of audio AI modules deployed in the wild. To support a large variety of downstream audio tasks, such as real-time ASR and audio-call enhancement, a PSE solution should operate in a streaming mode, i.e., input audio cleaning should happen in real-time with a small latency and real-time factor. Personalisation is typically achieved by extracting a target speaker's voice profile from an enrolment audio, in the form of a static embedding vector, and then using it to condition the output of a PSE model. However, a fixed target speaker embedding may not be optimal under all conditions. In this work, we present a streaming Transformer-based PSE model and propose a novel cross-attention approach that gives adaptive target speaker representations. We present extensive experiments and show that our proposed cross
Authors
(none)
Tags
Stats
Related papers
- Personalized Speech Enhancement Without A Separate Speaker Embedding Model (2024)5.24
- Real-time Joint Personalized Speech Enhancement And Acoustic Echo Cancellation (2022)4.52
- A Lightweight Dual-stage Framework For Personalized Speech Enhancement Based On Deepfilternet2 (2024)2.26
- Conv-transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-end Speech Recognition (2020)11.08
- The Potential Of Neural Speech Synthesis-based Data Augmentation For Personalized Speech Enhancement (2022)6.77
- Personalized Percepnet: Real-time, Low-complexity Target Voice Separation And Enhancement (2021)10.97
- Sef-pnet: Speaker Encoder-free Personalized Speech Enhancement With Local And Global Contexts Aggregation (2025)2.26
- Continuous Target Speech Extraction: Enhancing Personalized Diarization And Extraction On Complex Recordings (2024)3.58