The Sound Of My Voice: Speaker Representation Loss For Target Voice Separation
2019 Β· Seongkyu Mun, Soyeon Choe, Jaesung Huh, et al.
Abstract
Content and style representations have been widely studied in the field of style transfer. In this paper, we propose a new loss function using speaker content representation for audio source separation, and we call it speaker representation loss. The objective is to extract the target speaker voice from the noisy input and also remove it from the residual components. Compared to the conventional spectral reconstruction, our proposed framework maximizes the use of target speaker information by minimizing the distance between the speaker representations of reference and source separation output. We also propose triplet speaker representation loss as an additional criterion to remove the target speaker information from residual spectrogram output. VoiceFilter framework is adopted to evaluate source separation performance using the VCTK database, and we achieved improved performances compared to the baseline loss function without any additional network parameters.
Authors
(none)
Tags
Stats
Related papers
- Improving Zero-shot Voice Style Transfer Via Disentangled Representation Learning (2021)0.00
- Voicefilter: Targeted Voice Separation By Speaker-conditioned Spectrogram Masking (2018)17.48
- Individualized Conditioning And Negative Distances For Speaker Separation (2022)2.26
- Residual Speaker Representation For One-shot Voice Conversion (2023)0.00
- Disentangling Voice And Content With Self-supervision For Speaker Recognition (2023)2.26
- Enriching Source Style Transfer In Recognition-synthesis Based Non-parallel Voice Conversion (2021)9.23
- Optimizing Voice Conversion Network With Cycle Consistency Loss Of Speaker Identity (2020)9.59
- Voiceid Loss: Speech Enhancement For Speaker Verification (2019)13.39