End-to-end Supervised Hierarchical Graph Clustering For Speaker Diarization
2024 Β· Prachi Singh, Sriram Ganapathy
Abstract
Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering mo
Authors
(none)
Tags
Stats
Related papers
- Supervised Hierarchical Clustering Using Graph Neural Networks For Speaker Diarization (2023)0.00
- Deep Self-supervised Hierarchical Clustering For Speaker Diarization (2020)5.24
- Integrating End-to-end Neural And Clustering-based Diarization: Getting The Best Of Both Worlds (2020)13.74
- Advances In Integration Of End-to-end Neural And Clustering-based Diarization For Real Conversational Speech (2021)16.48
- End-to-end Speaker Diarization As Post-processing (2020)11.08
- End-to-end Neural Diarization: Reformulating Speaker Diarization As Simple Multi-label Classification (2020)0.00
- Speaker Diarization Using Two-pass Leave-one-out Gaussian PLDA Clustering Of DNN Embeddings (2021)2.26
- Tight Integration Of Neural- And Clustering-based Diarization Through Deep Unfolding Of Infinite Gaussian Mixture Model (2022)8.60