Cross-modal Audio-visual Co-learning For Text-independent Speaker Verification
2023 Β· Meng Liu, Kong Aik Lee, Longbiao Wang, et al.
Abstract
Visual speech (i.e., lip motion) is highly related to auditory speech due to the co-occurrence and synchronization in speech production. This paper investigates this correlation and proposes a cross-modal speech co-learning paradigm. The primary motivation of our cross-modal co-learning method is modeling one modality aided by exploiting knowledge from another modality. Specifically, two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Inside each booster, a max-feature-map embedded Transformer variant is proposed for modality alignment and enhanced feature generation. The network is co-learned both from scratch and with pretrained models. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement over independently trained audio-only/visual-only and baseline fusion systems, respectively
Authors
(none)
Tags
Stats
Related papers
- Improving Audio-visual Speech Recognition By Lip-subword Correlation Based Visual Pre-training And Cross-modal Fusion Encoder (2023)6.34
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- Alignvsr: Audio-visual Cross-modal Alignment For Visual Speech Recognition (2024)0.00
- Multi-modal Multi-correlation Learning For Audio-visual Speech Separation (2022)5.84
- Learning Contextually Fused Audio-visual Representations For Audio-visual Speech Recognition (2022)6.77
- Cross-modal Speaker Verification And Recognition: A Multilingual Perspective (2020)0.00
- Lipsound2: Self-supervised Pre-training For Lip-to-speech Reconstruction And Lip Reading (2021)11.39
- Audio-visual Speech Separation Based On Joint Feature Representation With Cross-modal Attention (2022)0.00