Human-clap: Human-perception-based Contrastive Language-audio Pretraining
2025 Β· Taisei Takano, Yuki Okamoto, Yusuke Kanamori, et al.
Abstract
Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.
Authors
(none)
Tags
Stats
Related papers
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- Clapspeech: Learning Prosody From Text Context With Contrastive Language-audio Pre-training (2023)0.00
- CLASP: Contrastive Language-speech Pretraining For Multilingual Multimodal Information Retrieval (2024)0.00
- Do Audio-language Models Understand Linguistic Variations? (2024)0.00
- Aligning Audio Captions With Human Preferences (2025)0.00
- Spo-clapscore: Enhancing Clap-based Alignment Prediction System With Standardize Preference Optimization, For The First XACLE Challenge (2026)0.78
- Gemo-clap: Gender-attribute-enhanced Contrastive Language-audio Pretraining For Accurate Speech Emotion Recognition (2023)0.00
- Collap: Contrastive Long-form Language-audio Pretraining With Musical Temporal Structure Augmentation (2024)3.58