USEV: Universal Speaker Extraction With Visual Cue
2021 Β· Zexu Pan, Meng Ge, Haizhou Li
Abstract
A speaker extraction algorithm seeks to extract the target speaker's speech from a multi-talker speech mixture. The prior studies focus mostly on speaker extraction from a highly overlapped multi-talker speech mixture. However, the target-interference speaker overlapping ratios could vary over a wide range from 0% to 100% in natural speech communication, furthermore, the target speaker could be absent in the speech mixture, the speech mixtures in such universal multi-talker scenarios are described as general speech mixtures. The speaker extraction algorithm requires an auxiliary reference, such as a video recording or a pre-recorded speech, to form top-down auditory attention on the target speaker. We advocate that a visual cue, i.e., lip movement, is more informative than an audio cue, i.e., pre-recorded speech, to serve as the auxiliary reference for speaker extraction in disentangling the target speaker from a general speech mixture. In this paper, we propose a universal speaker ext
Authors
(none)
Tags
Stats
Related papers
- USEF-TSE: Universal Speaker Embedding Free Target Speaker Extraction (2024)11.88
- Muse: Multi-modal Target Speaker Extraction With Visual Cues (2020)11.85
- Audio-visual Active Speaker Extraction For Sparsely Overlapped Multi-talker Speech (2023)7.50
- Imaginenet: Target Speaker Extraction With Intermittent Visual Cue Through Embedding Inpainting (2022)7.16
- USED: Universal Speaker Extraction And Diarization (2023)7.50
- Selective Listening By Synchronizing Speech With Lips (2021)11.85
- Single Microphone Speaker Extraction Using Unified Time-frequency Siamese-unet (2022)3.58
- New Insights On Target Speaker Extraction (2022)0.00