Separate And Reconstruct: Asymmetric Encoder-decoder For Speech Separation
2024 Β· Ui-Hyeop Shin, Sangyoun Lee, Taehan Kim, et al.
Abstract
In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage of the network. Instead, we propose a more intuitive strategy that separates features earlier by expanding the feature sequence to the number of speakers as an extra dimension. To achieve this, an asymmetric strategy is presented in which the encoder and decoder are partitioned to perform distinct processing in separation tasks. The encoder analyzes features, and the output of the encoder is split into the number of speakers to be separated. The separated sequences are then reconstructed by the weight-shared decoder, which also performs cross-speaker processing. Without relying on speaker information, the weight-shared network in the decoder directly learns to discriminate features using a separation objective. In addition, to improve performance, tr
Authors
(none)
Tags
Stats
Related papers
- Single-channel Speech Separation With Auxiliary Speaker Embeddings (2019)0.00
- Improved Speech Separation With Time-and-frequency Cross-domain Joint Embedding And Clustering (2019)10.74
- End-to-end Speech Separation With Unfolded Iterative Phase Reconstruction (2018)15.00
- Boosting Unknown-number Speaker Separation With Transformer Decoder-based Attractor (2024)0.00
- Dualsep: A Light-weight Dual-encoder Convolutional Recurrent Network For Real-time In-car Speech Separation (2024)0.00
- Directed Speech Separation For Automatic Speech Recognition Of Long Form Conversational Speech (2021)2.26
- Multi-dimensional And Multi-scale Modeling For Speech Separation Optimized By Discriminative Learning (2023)0.00
- Conv-tasnet: Surpassing Ideal Time-frequency Magnitude Masking For Speech Separation (2018)24.08