PAEFF: Precise Alignment And Enhanced Gated Feature Fusion For Face-voice Association
2025 Β· Abdul Hannan, Muhammad Arslan Manzoor, Shah Nawaz, et al.
Abstract
We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
Authors
(none)
Tags
Stats
Related papers
- Contrastive Learning-based Chaining-cluster For Multilingual Voice-face Association (2024)4.78
- Cross-modal Speaker Verification And Recognition: A Multilingual Perspective (2020)0.00
- Robust Audio-visual Target Speaker Extraction With Emotion-aware Multiple Enrollment Fusion (2025)0.00
- Attention-based Audio-visual Fusion For Robust Automatic Speech Recognition (2018)16.67
- AMFFCN: Attentional Multi-layer Feature Fusion Convolution Network For Audio-visual Speech Enhancement (2021)0.00
- Enhancing Modal Fusion By Alignment And Label Matching For Multimodal Emotion Recognition (2024)6.34
- Attentive Fusion Enhanced Audio-visual Encoding For Transformer Based Robust Speech Recognition (2020)0.00
- Comparative Analysis Of Modality Fusion Approaches For Audio-visual Person Identification And Verification (2024)0.00