A Study On Zero-shot Non-intrusive Speech Assessment Using Large Language Models
2024 Β· Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, et al.
Abstract
This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly worse than SpeechLMScore in quality estimation. Furthermore, GPT-Whisper outperforms supervised non-
Authors
(none)
Tags
Stats
Related papers
- Prompting The Hidden Talent Of Web-scale Speech Models For Zero-shot Task Generalization (2023)16.38
- Probing The Hidden Talent Of ASR Foundation Models For L2 English Oral Assessment (2025)0.00
- The Zero Resource Speech Benchmark 2021: Metrics And Baselines For Unsupervised Spoken Language Modeling (2020)0.00
- Whisper-lm: Improving ASR Models With Language Models For Low-resource Languages (2025)3.29
- Multilingual Distilwhisper: Efficient Distillation Of Multi-task Speech Models Via Language-specific Experts (2023)8.09
- Fine-tuning Whisper On Low-resource Languages For Real-world Applications (2024)0.00
- Putting Gpt-4o To The Sword: A Comprehensive Evaluation Of Language, Vision, Speech, And Multimodal Proficiency (2024)0.00
- Investigating The Emergent Audio Classification Ability Of ASR Foundation Models (2023)5.84