Speaker-conditioned Target Speaker Extraction Based On Customized LSTM Cells
2021 Β· Ragini Sinha, Marvin Tammen, Christian Rollwage, et al.
Abstract
Speaker-conditioned target speaker extraction systems rely on auxiliary information about the target speaker to extract the target speaker signal from a mixture of multiple speakers. Typically, a deep neural network is applied to isolate the relevant target speaker characteristics. In this paper, we focus on a single-channel target speaker extraction system based on a CNN-LSTM separator network and a speaker embedder network requiring reference speech of the target speaker. In the LSTM layer of the separator network, we propose to customize the LSTM cells in order to only remember the specific voice patterns corresponding to the target speaker by modifying the information processing in the forget gate. Experimental results for two-speaker mixtures using the Librispeech dataset show that this customization significantly improves the target speaker extraction performance compared to using standard LSTM cells.
Authors
(none)
Tags
Stats
Related papers
- Speaker-conditioning Single-channel Target Speaker Extraction Using Conformer-based Architectures (2022)6.34
- New Insights On Target Speaker Extraction (2022)0.00
- Spectron: Target Speaker Extraction Using Conditional Transformer With Adversarial Refinement (2024)0.00
- Voicefilter: Targeted Voice Separation By Speaker-conditioned Spectrogram Masking (2018)17.48
- Lightweight Dual-channel Target Speaker Separation For Mobile Voice Communication (2021)0.00
- Memory Time Span In Lstms For Multi-speaker Source Separation (2018)3.58
- Single-channel Speech Separation With Auxiliary Speaker Embeddings (2019)0.00
- Individualized Conditioning And Negative Distances For Speaker Separation (2022)2.26