Learning Self-supervised Audio-visual Representations For Sound Recommendations
2024 Β· Sudha Krishnamurthy
Abstract
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound
Authors
(none)
Tags
Stats
Related papers
- Learning Speech Representations From Raw Audio By Joint Audiovisual Self-supervision (2020)0.00
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Investigating Self-supervised Representations For Audio-visual Deepfake Detection (2025)0.00
- Conformer-based Self-supervised Learning For Non-speech Audio Tasks (2021)7.50
- Recent Advances And Challenges In Deep Audio-visual Correlation Learning (2022)5.24
- Av-data2vec: Self-supervised Learning Of Audio-visual Speech Representations With Contextualized Target Representations (2023)0.00
- Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification (2023)9.23
- Perfect Match: Improved Cross-modal Embeddings For Audio-visual Synchronisation (2018)14.19