Speaker-targeted Audio-visual Models For Speech Recognition In Cocktail-party Environments
2019 Β· Guan-Lin Chao, William Chan, Ian Lane
Abstract
Speech recognition in cocktail-party environments remains a significant challenge for state-of-the-art speech recognition systems, as it is extremely difficult to extract an acoustic signal of an individual speaker from a background of overlapping speech with similar frequency and temporal characteristics. We propose the use of speaker-targeted acoustic and audio-visual models for this task. We complement the acoustic features in a hybrid DNN-HMM model with information of the target speaker's identity as well as visual features from the mouth region of the target speaker. Experimentation was performed using simulated cocktail-party data generated from the GRID audio-visual corpus by overlapping two speakers's speech on a single acoustic channel. Our audio-only baseline achieved a WER of 26.3%. The audio-visual model improved the WER to 4.4%. Introducing speaker identity information had an even more pronounced effect, improving the WER to 3.6%. Combining both approaches, however, did no
Authors
(none)
Tags
Stats
Related papers
- Audio-visual Multi-channel Speech Separation, Dereverberation And Recognition (2022)6.77
- Face Landmark-based Speaker-independent Audio-visual Speech Enhancement In Multi-talker Environments (2018)12.40
- Audio-visual Target Speaker Enhancement On Multi-talker Environment Using Event-driven Cameras (2019)8.09
- An Analysis Of Speech Enhancement And Recognition Losses In Limited Resources Multi-talker Single Channel Audio-visual ASR (2019)4.52
- Audio-visual Approach For Multimodal Concurrent Speaker Detection (2024)0.00
- Late Audio-visual Fusion For In-the-wild Speaker Diarization (2022)3.58
- End-to-end Multi-talker Audio-visual ASR Using An Active Speaker Attention Module (2022)0.00
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00