It's Never Too Late: Fusing Acoustic Information Into Large Language Models For Automatic Speech Recognition
2024 Β· Chen Chen, Ruizhe Li, Yuchen Hu, et al.
Abstract
Recent studies have successfully shown that large language models (LLMs) can be successfully used for generative error correction (GER) on top of the automatic speech recognition (ASR) output. Specifically, an LLM is utilized to carry out a direct mapping from the N-best hypotheses list generated by an ASR system to the predicted output transcription. However, despite its effectiveness, GER introduces extra data uncertainty since the LLM is trained without taking into account acoustic information available in the speech signal. In this work, we aim to overcome such a limitation by infusing acoustic information before generating the predicted transcription through a novel late fusion solution termed Uncertainty-Aware Dynamic Fusion (UADF). UADF is a multimodal fusion approach implemented into an auto-regressive decoding process and works in two stages: (i) It first analyzes and calibrates the token-level LLM decision, and (ii) it then dynamically assimilates the information from the aco
Authors
(none)
Tags
Stats
Related papers
- Delayed Fusion: Integrating Large Language Models Into First-pass Decoding In End-to-end Speech Recognition (2025)5.84
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00
- Multilingual And Fully Non-autoregressive ASR With Large Language Model Fusion: A Comprehensive Study (2024)0.00
- Let's Fuse Step By Step: A Generative Fusion Decoding Algorithm With Llms For Robust And Instruction-aware ASR And OCR (2024)0.00
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Multimodal Large Language Models With Fusion Low Rank Adaptation For Device Directed Speech Detection (2024)0.00
- Large Language Model Based Generative Error Correction: A Challenge And Baselines For Speech Recognition, Speaker Tagging, And Emotion Recognition (2024)7.81
- Generative Error Correction For Code-switching Speech Recognition Using Large Language Models (2023)0.00