Voice-face Cross-modal Matching And Retrieval: A Benchmark
2019 Β· Chuyuan Xiong, Deyuan Zhang, Tao Liu, et al.
Abstract
Cross-modal associations between voice and face from a person can be learnt algorithmically, which can benefit a lot of applications. The problem can be defined as voice-face matching and retrieval tasks. Much research attention has been paid on these tasks recently. However, this research is still in the early stage. Test schemes based on random tuple mining tend to have low test confidence. Generalization ability of models can not be evaluated by small scale datasets. Performance metrics on various tasks are scarce. A benchmark for this problem needs to be established. In this paper, first, a framework based on comprehensive studies is proposed for voice-face matching and retrieval. It achieves state-of-the-art performance with various performance metrics on different tasks and with high test confidence on large scale datasets, which can be taken as a baseline for the follow-up research. In this framework, a voice anchored L2-Norm constrained metric space is proposed, and cross-modal
Authors
(none)
Tags
Stats
Related papers
- Seeking The Shape Of Sound: An Adaptive Framework For Learning Voice-face Association (2021)11.39
- Fuse After Align: Improving Face-voice Association Learning Via Multimodal Encoder (2024)0.00
- Towards Identity-aware Cross-modal Retrieval: A Dataset And A Baseline (2024)1.56
- Audio Retrieval With Natural Language Queries: A Benchmark Study (2021)16.29
- Rethinking Benchmarks For Cross-modal Image-text Retrieval (2023)13.11
- Learnt Quasi-transitive Similarity For Retrieval From Large Collections Of Faces (2016)5.24
- Learnable Pins: Cross-modal Embeddings For Person Identity (2018)15.22
- Decoupled Cross-modal Alignment Network For Text-rgbt Person Retrieval And A High-quality Benchmark (2025)0.00