Single-channel Multi-talker Speech Recognition With Permutation Invariant Training
2017 Β· Yanmin Qian, Xuankai Chang, Dong Yu
Abstract
Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the front-end feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectur
Authors
(none)
Tags
Stats
Related papers
- Recognizing Multi-talker Speech With Permutation Invariant Training (2017)12.81
- Permutation Invariant Training Of Deep Models For Speaker-independent Multi-talker Speech Separation (2016)0.00
- Single-channel Speech Separation Using Soft-minimum Permutation Invariant Training (2021)2.26
- Multi-talker Speech Separation With Utterance-level Permutation Invariant Training Of Deep Recurrent Neural Networks (2017)20.90
- Interrupted And Cascaded Permutation Invariant Training For Speech Separation (2019)4.52
- Probabilistic Permutation Invariant Training For Speech Separation (2019)7.81
- Separating Long-form Speech With Group-wise Permutation Invariant Training (2021)4.52
- Progressive Joint Modeling In Unsupervised Single-channel Overlapped Speech Recognition (2017)11.67