Abstract

The common standard for quality evaluation of automatic speech recognition (ASR) systems is reference-based metrics such as the Word Error Rate (WER), computed using manual ground-truth transcriptions that are time-consuming and expensive to obtain. This work proposes a multi-language referenceless quality metric, which allows comparing the performance of different ASR models on a speech dataset without ground truth transcriptions. To estimate the quality of ASR hypotheses, a pre-trained language model (LM) is fine-tuned with contrastive learning in a self-supervised learning manner. In experiments conducted on several unseen test datasets consisting of outputs from top commercial ASR engines in various languages, the proposed referenceless metric obtains a much higher correlation with WER scores and their ranks than the perplexity metric from the state-of-art multi-lingual LM in all experiments, and also reduces WER by more than \(7%\) when used for ensembling hypotheses. The fine-tun

Authors

(none)

Tags

  • Speech Recognition
  • Speech Translation

Stats

  • citations0
  • S2 citationsβ€”
  • github stars17
  • HF likes0
  • heat score2.51
  • arxiv keyyuksel2023a

Related papers