Learnable Pins: Cross-modal Embeddings For Person Identity
2018 Β· Arsha Nagrani, Samuel Albanie, Andrew Zisserman
Abstract
We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.
Authors
(none)
Tags
Stats
Related papers
- Towards Identity-aware Cross-modal Retrieval: A Dataset And A Baseline (2024)1.56
- Seeking The Shape Of Sound: An Adaptive Framework For Learning Voice-face Association (2021)11.39
- Fuse After Align: Improving Face-voice Association Learning Via Multimodal Encoder (2024)0.00
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Transcription-enriched Joint Embeddings For Spoken Descriptions Of Images And Videos (2020)0.00
- Voice-face Cross-modal Matching And Retrieval: A Benchmark (2019)0.00
- Deep Latent Space Learning For Cross-modal Mapping Of Audio And Visual Signals (2019)12.17