Analyzing The Impact Of Speaker Localization Errors On Speech Separation For Automatic Speech Recognition

Abstract

We investigate the effect of speaker localization on the performance of speech recognition systems in a multispeaker, multichannel environment. Given the speaker location information, speech separation is performed in three stages. In the first stage, a simple delay-and-sum (DS) beamformer is used to enhance the signal impinging from the speaker location which is then used to estimate a time-frequency mask corresponding to the localized speaker using a neural network. This mask is used to compute the second order statistics and to derive an adaptive beamformer in the third stage. We generated a multichannel, multispeaker, reverberated, noisy dataset inspired from the well studied WSJ0-2mix and study the performance of the proposed pipeline in terms of the word error rate (WER). An average WER of \(29.4\)% was achieved using the ground truth localization information and \(42.4\)% using the localization information estimated via GCC-PHAT. The signal-to-interference ratio (SIR) between th

Analyzing The Impact Of Speaker Localization Errors On Speech Separation For Automatic Speech Recognition

Abstract

Authors

Tags

Stats

Related papers