STHG: Spatial-temporal Heterogeneous Graph Learning For Advanced Audio-visual Diarization
2023 Β· Kyle Min
Abstract
This report introduces our novel method named STHG for the Audio-Visual Diarization task of the Ego4D Challenge 2023. Our key innovation is that we model all the speakers in a video using a single, unified heterogeneous graph learning framework. Unlike previous approaches that require a separate component solely for the camera wearer, STHG can jointly detect the speech activities of all people including the camera wearer. Our final method obtains 61.1% DER on the test set of Ego4D, which significantly outperforms all the baselines as well as last year's winner. Our submission achieved 1st place in the Ego4D Challenge 2023. We additionally demonstrate that applying the off-the-shelf speech recognition system to the diarized speech segments by STHG produces a competitive performance on the Speech Transcription task of this challenge.
Authors
(none)
Tags
Stats
Related papers
- Exploring Detection-based Method For Speaker Diarization @ Ego4d Audio-only Diarization Challenge 2022 (2022)0.00
- Audio-visual Speaker Diarization Based On Spatiotemporal Bayesian Fusion (2016)14.51
- Late Audio-visual Fusion For In-the-wild Speaker Diarization (2022)3.58
- Supervised Hierarchical Clustering Using Graph Neural Networks For Speaker Diarization (2023)0.00
- The HUAWEI Speaker Diarisation System For The Voxceleb Speaker Diarisation Challenge (2020)0.00
- Geodesic Interpolation Of Frame-wise Speaker Embeddings For The Diarization Of Meeting Scenarios (2024)5.24
- Joint Training Or Not: An Exploration Of Pre-trained Speech Models In Audio-visual Speaker Diarization (2023)0.00
- Low-latency Online Speaker Diarization With Graph-based Label Generation (2021)8.60