Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation
2018 Β· Soo-Whan Chung, Joon Son Chung, Hong-Goo Kang
Abstract
This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronization. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronization task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.
Authors
(none)
Tags
Stats
Related papers
- Cross-modal Embeddings For Video And Audio Retrieval (2018)11.08
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Looking Into Your Speech: Learning Cross-modal Affinity For Audio-visual Speech Separation (2021)11.67
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Syncvsr: Data-efficient Visual Speech Recognition With End-to-end Crossmodal Audio Token Synchronization (2024)8.35
- Unified Video-language Pre-training With Synchronized Audio (2024)0.00
- Mmaudio: Taming Multimodal Joint Training For High-quality Video-to-audio Synthesis (2024)0.00
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84