Improving Multi-scale Aggregation Using Feature Pyramid Module For Robust Speaker Verification Of Variable-duration Utterances
2020 Β· Youngmoon Jung, Seong Min Kye, Yeunju Choi, et al.
Abstract
Currently, the most widely used approach for speaker verification is the deep speaker embedding learning. In this approach, we obtain a speaker embedding vector by pooling single-scale features that are extracted from the last layer of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes multi-scale features from different layers of the feature extractor, has recently been introduced and shows superior performance for variable-duration utterances. To increase the robustness dealing with utterances of arbitrary duration, this paper improves the MSA by using a feature pyramid module. The module enhances speaker-discriminative information of features from multiple layers via a top-down pathway and lateral connections. We extract speaker embeddings using the enhanced features that contain rich speaker information with different time scales. Experiments on the VoxCeleb dataset show that the proposed module improves previous MSA methods with a smaller number of paramete
Authors
(none)
Tags
Stats
Related papers
- A Unified Deep Learning Framework For Short-duration Speaker Verification In Adverse Environments (2020)9.41
- Self-attentive Multi-layer Aggregation With Feature Recalibration And Normalization For End-to-end Speaker Verification System (2020)0.00
- Improving Speaker Representations Using Contrastive Losses On Multi-scale Features (2024)0.00
- Mfa-conformer: Multi-scale Feature Aggregation Conformer For Automatic Speaker Verification (2022)15.46
- Rawnext: Speaker Verification System For Variable-duration Utterances With Deep Layer Aggregation And Extended Dynamic Scaling Policies (2021)12.24
- Double Multi-head Attention For Speaker Verification (2020)8.09
- Deep Speaker Embedding Learning With Multi-level Pooling For Text-independent Speaker Verification (2019)0.00
- Speaker Verification In Multi-speaker Environments Using Temporal Feature Fusion (2022)0.00