Improved Speech Separation With Time-and-frequency Cross-domain Joint Embedding And Clustering
2019 Β· Gene-Ping Yang, Chao-I Tuan, Hung-Yi Lee, et al.
Abstract
Speech separation has been very successful with deep learning techniques. Substantial effort has been reported based on approaches over spectrogram, which is well known as the standard time-and-frequency cross-domain representation for speech signals. It is highly correlated to the phonetic structure of speech, or "how the speech sounds" when perceived by human, but primarily frequency domain features carrying temporal behaviour. Very impressive work achieving speech separation over time domain was reported recently, probably because waveforms in time domain may describe the different realizations of speech in a more precise way than spectrogram. In this paper, we propose a framework properly integrating the above two directions, hoping to achieve both purposes. We construct a time-and-frequency feature map by concatenating the 1-dim convolution encoded feature map (for time domain) and the spectrogram (for frequency domain), which was then processed by an embedding network and cluster
Authors
(none)
Tags
Stats
Related papers
- Orthonormal Embedding-based Deep Clustering For Single-channel Speech Separation (2019)0.00
- Single-channel Multi-speaker Separation Using Deep Clustering (2016)0.00
- Spatial And Spectral Deep Attention Fusion For Multi-channel Speech Separation Using Deep Embedding Features (2020)0.00
- Efficient Integration Of Multi-channel Information For Speaker-independent Speech Separation (2020)0.00
- Discriminative Learning For Monaural Speech Separation Using Deep Embedding Features (2019)8.60
- Separate And Reconstruct: Asymmetric Encoder-decoder For Speech Separation (2024)0.00
- Deep Clustering And Conventional Networks For Music Separation: Stronger Together (2016)14.76
- Multi-channel Speech Separation Using Deep Embedding Model With Multilayer Bootstrap Networks (2019)0.00