Seeking The Shape Of Sound: An Adaptive Framework For Learning Voice-face Association
2021 Β· Peisong Wen, Qianqian Xu, Yangbangyan Jiang, et al.
Abstract
Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality alignment loss where both global and local information are considered. Compared with the existing methods, we introduce a global loss into the modality alignment process. The global component of the loss is driven by the identity classification. Theoretically, we show that minimizing the loss could maximize the distance between embeddings across different identities while minimizing the distance between embeddings belonging to the same identity, in a global sense (instead of a min
Authors
(none)
Tags
Stats
Related papers
- Fuse After Align: Improving Face-voice Association Learning Via Multimodal Encoder (2024)0.00
- Voice-face Cross-modal Matching And Retrieval: A Benchmark (2019)0.00
- Learnable Pins: Cross-modal Embeddings For Person Identity (2018)15.22
- See, Hear, And Read: Deep Aligned Representations (2017)0.00
- Modest-align: Data-efficient Alignment For Vision-language Models (2025)0.00
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19
- A Multi-level Alignment Training Scheme For Video-and-language Grounding (2022)3.58
- Tokenflow: Rethinking Fine-grained Cross-modal Alignment In Vision-language Retrieval (2022)0.00