Metric Learning With Progressive Self-distillation For Audio-visual Embedding Learning
2025 Β· Donghuo Zeng, Kazushi Ikeda
Abstract
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specific
Authors
(none)
Tags
Stats
Related papers
- Self-supervised Audio-visual Speech Representations Learning By Multimodal Self-distillation (2022)0.00
- Multi-task Metric Learning For Text-independent Speaker Verification (2020)0.00
- Improving The Gap In Visual Speech Recognition Between Normal And Silent Speech Based On Metric Learning (2023)0.00
- Audio Representation Learning By Distilling Video As Privileged Information (2023)0.00
- Centroid-based Deep Metric Learning For Speaker Recognition (2019)13.79
- Complete Cross-triplet Loss In Label Space For Audio-visual Cross-modal Retrieval (2022)5.84
- Two-stage Triplet Loss Training With Curriculum Augmentation For Audio-visual Retrieval (2023)0.00
- Learning Self-supervised Audio-visual Representations For Sound Recommendations (2024)2.26