Spo-clapscore: Enhancing Clap-based Alignment Prediction System With Standardize Preference Optimization, For The First XACLE Challenge
2026 Β· Taisei Takano, Ryoya Yoshida
Abstract
The first XACLE Challenge (x-to-audio alignment challenge) addresses the critical need for automatic evaluation metrics that correlate with human perception of audio-text semantic alignment. In this paper, we describe the "Takano_UTokyo_03" system submitted to XACLE Challenge. Our approach leverages a CLAPScore-based architecture integrated with a novel training method called Standardized Preference Optimization (SPO). SPO standardizes the raw alignment scores provided by each listener, enabling the model to learn relative preferences and mitigate the impact of individual scoring biases. Additionally, we employ listener screening to exclude listeners with inconsistent ratings. Experimental evaluations demonstrate that both SPO and listener screening effectively improve the correlation with human judgment. Our system achieved 6th place in the challenge with a Spearman's rank correlation coefficient (SRCC) of 0.6142, demonstrating competitive performance within a marginal gap from the to
Authors
(none)
Tags
Stats
Related papers
- Human-clap: Human-perception-based Contrastive Language-audio Pretraining (2025)4.52
- HCLAS-X: Hierarchical And Cascaded Lyrics Alignment System Using Multimodal Cross-correlation (2023)0.00
- PAT: Parameter-free Audio-text Aligner To Boost Zero-shot Audio Classification (2024)0.00
- Putting HUMANS First: Efficient LAM Evaluation With Human Preference Alignment (2026)0.00
- The T12 System For Audiomos Challenge 2025: Audio Aesthetics Score Prediction System Using KAN- And Versa-based Models (2025)0.00
- M2D-CLAP: Masked Modeling Duo Meets CLAP For Learning General-purpose Audio-language Representation (2024)7.81
- Speechcolab Leaderboard: An Open-source Platform For Automatic Speech Recognition Evaluation (2024)9.05
- CLAIR-A: Leveraging Large Language Models To Judge Audio Captions (2024)2.00