Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation
2018 Β· Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang
Abstract
This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronization task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Video And Audio Are Images: A Cross-modal Mixer For Original Data On Video-audio Retrieval (2023)7.16
- Audio-visual Embedding For Cross-modal Musicvideo Retrieval Through Supervised Deep CCA (2019)11.93
- Deep Latent Space Learning For Cross-modal Mapping Of Audio And Visual Signals (2019)12.17
- Fuse After Align: Improving Face-voice Association Learning Via Multimodal Encoder (2024)0.00
- Multiscale Matching Driven By Cross-modal Similarity Consistency For Audio-text Retrieval (2024)4.52
- VMCML: Video And Music Matching Via Cross-modality Lifting (2023)2.26
- Maximal Matching Matters: Preventing Representation Collapse For Robust Cross-modal Retrieval (2025)2.26