UNSSOR: Unsupervised Neural Speech Separation By Leveraging Over-determined Training Mixtures

·2023

arXiv:wang2023unssor ↗Google Scholar ↗Semantic Scholar ↗

Abstract

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for \(\textbf\{u\}\)nsupervised \(\textbf\{n\}\)eural \(\textbf\{s\}\)peech \(\textbf\{s\}\)eparation by leveraging \(\textbf\{o\}\)ver-determined training mixtu\(\textbf\{r\}\)es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constra

Abstract

Related papers