Recursive Attentive Pooling For Extracting Speaker Embeddings From Multi-speaker Recordings
2024 Β· Shota Horiguchi, Atsushi Ando, Takafumi Moriya, et al.
Abstract
This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker speech applications such as speaker diarization and target-speaker speech processing. Despite the challenges of obtaining a single speaker's speech without pre-registration in multi-speaker scenarios, most studies on speaker embedding extraction focus on extracting embeddings only from single-speaker recordings. Some methods have been proposed for extracting speaker embeddings directly from multi-speaker recordings, but they typically require preparing a model for each possible number of speakers or involve complicated training procedures. The proposed method computes the embeddings of multiple speakers by focusing on different parts of the frame-wise embeddings extracted from the input multi-speaker audio. This is achieved by recursively computing
Authors
(none)
Tags
Stats
Related papers
- Attentive Statistics Pooling For Deep Speaker Embedding (2018)18.88
- Training Speaker Embedding Extractors Using Multi-speaker Audio With Unknown Speaker Boundaries (2022)3.58
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- Double Multi-head Attention For Speaker Verification (2020)8.09
- Self Multi-head Attention For Speaker Recognition (2019)13.84
- Multi-stage Speaker Extraction With Utterance And Frame-level Reference Signals (2020)12.54
- Removing Speaker Information From Speech Representation Using Variable-length Soft Pooling (2024)0.00
- End-to-end Multi-microphone Speaker Extraction Using Relative Transfer Functions (2025)0.00