A Purely End-to-end System For Multi-speaker Speech Recognition
2018 Β· Hiroshi Seki, Takaaki Hori, Shinji Watanabe, et al.
Abstract
Recently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1 % relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previo
Authors
(none)
Tags
Stats
Related papers
- End-to-end Monaural Multi-speaker ASR System Without Pretraining (2018)11.93
- MIMO-SPEECH: End-to-end Multi-channel Multi-speaker Speech Recognition (2019)13.93
- End-to-end Multi-speaker Speech Recognition Using Speaker Embeddings And Transfer Learning (2019)9.41
- Single-channel Multi-talker Speech Recognition With Permutation Invariant Training (2017)12.10
- Hypothesis Clustering And Merging: Novel Multitalker Speech Recognition With Speaker Tokens (2024)0.00
- Multi-label Training For Text-independent Speaker Identification (2022)0.00
- Joint Speaker Encoder And Neural Back-end Model For Fully End-to-end Automatic Speaker Verification With Multiple Enrollment Utterances (2022)0.00
- Time-domain Speech Extraction With Spatial Information And Multi Speaker Conditioning Mechanism (2021)7.81