Two-stage Triplet Loss Training With Curriculum Augmentation For Audio-visual Retrieval
2023 Β· Donghuo Zeng, Kazushi Ikeda
Abstract
The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss functions due to a scarcity of hard triples. Our approach then applies hard triplet mining in the aug
Authors
(none)
Tags
Stats
Related papers
- Complete Cross-triplet Loss In Label Space For Audio-visual Cross-modal Retrieval (2022)5.84
- Triplet Entropy Loss: Improving The Generalisation Of Short Speech Language Identification Systems (2020)0.00
- Metric Learning With Progressive Self-distillation For Audio-visual Embedding Learning (2025)3.58
- Learning Efficient Representations For Keyword Spotting With Triplet Loss (2021)11.76
- Generative Data Augmentation Guided By Triplet Loss For Speech Emotion Recognition (2022)3.58
- Estimated Audio-caption Correspondences Improve Language-based Audio Retrieval (2024)0.00
- Semi Supervised Learning For Few-shot Audio Classification By Episodic Triplet Mining (2021)0.00
- Scenario Aware Speech Recognition: Advancements For Apollo Fearless Steps & Chime-4 Corpora (2021)5.84