Recognizing Multi-talker Speech With Permutation Invariant Training
2017 Β· Dong Yu, Xuankai Chang, Yanmin Qian
Abstract
In this paper, we propose a novel technique for direct recognition of multiple speech streams given the single channel of mixed speech, without first separating them. Our technique is based on permutation invariant training (PIT) for automatic speech recognition (ASR). In PIT-ASR, we compute the average cross entropy (CE) over all frames in the whole utterance for each possible output-target assignment, pick the one with the minimum CE, and optimize for that assignment. PIT-ASR forces all the frames of the same speaker to be aligned with the same output layer. This strategy elegantly solves the label permutation problem and speaker tracing problem in one shot. Our experiments on artificially mixed AMI data showed that the proposed approach is very promising.
Authors
(none)
Tags
Stats
Related papers
- Single-channel Multi-talker Speech Recognition With Permutation Invariant Training (2017)12.10
- Permutation Invariant Training Of Deep Models For Speaker-independent Multi-talker Speech Separation (2016)0.00
- Multi-talker Speech Separation With Utterance-level Permutation Invariant Training Of Deep Recurrent Neural Networks (2017)20.90
- Separating Long-form Speech With Group-wise Permutation Invariant Training (2021)4.52
- Single-channel Speech Separation Using Soft-minimum Permutation Invariant Training (2021)2.26
- Interrupted And Cascaded Permutation Invariant Training For Speech Separation (2019)4.52
- Probabilistic Permutation Invariant Training For Speech Separation (2019)7.81
- Transcription-free Fine-tuning Of Speech Separation Models For Noisy And Reverberant Multi-speaker Automatic Speech Recognition (2024)3.58