The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-visual Target Speaker Extraction
2023 Β· Shilong Wu, Chenxi Wang, Hang Chen, et al.
Abstract
Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting,
Authors
(none)
Tags
Stats
Related papers
- An Audio-quality-based Multi-strategy Approach For Target Speaker Extraction In The MISP 2023 Challenge (2024)2.26
- The NPU-ASLP System For Audio-visual Speech Recognition In MISP 2022 Challenge (2023)7.16
- The Flyspeech Audio-visual Speaker Diarization System For MISP Challenge 2022 (2023)0.00
- Challenges And Insights: Exploring 3D Spatial Features And Complex Networks On The MISP Dataset (2023)0.00
- MLCA-AVSR: Multi-layer Cross Attention Fusion Based Audio-visual Speech Recognition (2024)10.07
- Multi-input Multi-output Target-speaker Voice Activity Detection For Unified, Flexible, And Robust Audio-visual Speaker Diarization (2024)0.00
- Enhancing Real-world Active Speaker Detection With Multi-modal Extraction Pre-training (2024)5.24
- Robust End-to-end Deep Audiovisual Speech Recognition (2016)0.00