Weakly Supervised Training Of Hierarchical Attention Networks For Speaker Identification
2020 Β· Yanpei Shi, Qiang Huang, Thomas Hain
Abstract
Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. In this paper, a hierarchical attention network is proposed to solve a weakly labelled speaker identification problem. The use of a hierarchical structure, consisting of a frame-level encoder and a segment-level encoder, aims to learn speaker related information locally and globally. Speech streams are segmented into fragments. The frame-level encoder with attention learns features and highlights the target related frames locally, and output a fragment based embedding. The segment-level encoder works with a second attention layer to emphasize the fragments probably related to target speakers. The global information is finally collected from segment-level module to predict speakers via a classifier. To evaluate the effectiveness of the proposed approach, artificial datasets based on Switchboard Cellular part1 (SWBC) and Voxceleb1 are constructed in two conditions, where speaker
Authors
(none)
Tags
Stats
Related papers
- T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model (2020)0.00
- H-VECTORS: Utterance-level Speaker Embedding Using A Hierarchical Attention Model (2019)0.00
- Self Multi-head Attention For Speaker Recognition (2019)13.84
- Weakly Supervised Training Of Speaker Identification Models (2018)5.84
- Staircase Network: Structural Language Identification Via Hierarchical Attentive Units (2018)2.26
- FDN: Finite Difference Network With Hierarchical Convolutional Features For Text-independent Speaker Verification (2021)0.00
- Frequency And Temporal Convolutional Attention For Text-independent Speaker Recognition (2019)0.00
- Graph Attention Networks For Speaker Verification (2020)9.23