On Word Error Rate Definitions And Their Efficient Computation For Multi-speaker Speech Recognition Systems
2022 Β· Thilo von Neumann, Christoph Boeddeker, Keisuke Kinoshita, et al.
Abstract
We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which.
Authors
(none)
Tags
Stats
Related papers
- Beyond Levenshtein: Leveraging Multiple Algorithms For Robust Word Error Rate Computations And Granular Error Classifications (2024)2.26
- Automatic Speech Recognition System-independent Word Error Rate Estimation (2024)3.58
- Fast Word Error Rate Estimation Using Self-supervised Representations For Speech And Text (2023)5.24
- MIMO-SPEECH: End-to-end Multi-channel Multi-speaker Speech Recognition (2019)13.93
- Speech Emotion Recognition With ASR Transcripts: A Comprehensive Study On Word Error Rate And Fusion Techniques (2024)9.03
- Semantic-wer: A Unified Metric For The Evaluation Of ASR Transcript For End Usability (2021)0.00
- On The Impact Of Word Error Rate On Acoustic-linguistic Speech Emotion Recognition: An Update For The Deep Learning Era (2021)0.00
- Predicting Word Error Rate For Reverberant Speech (2019)7.16