Large Language Model Guided Decoding For Self-supervised Speech Recognition
2025 Β· Eyal Cohen, Bhiksha Raj, Joseph Keshet
Abstract
Self-supervised automatic speech recognition (SSL-ASR) is an ASR approach that uses speech encoders pretrained on large amounts of unlabeled audio (e.g., wav2vec2.0 or HuBERT) and then fine-tunes them with limited labeled data to perform transcription. Decoding is usually performed with a CTC decoder, whose hypotheses are scored and refined using an external language model (LM), typically an n-gram or neural LM, which guides beam search to produce the final transcription. Using Large Language Models (LLMs) as external LMs remains a challenge, as their word probabilities are overly confident. The proposed method integrates an LLM with an SSL acoustic model by using the LLM's decoding mechanism to generate a set of candidate next tokens. For each candidate, the SSL model provides an acoustic score by aligning it to the input acoustics of the SSL model. A combined acoustic and LLM score is then calculated based on decomposing the MAP estimator of words given the acoustic signal. The token
Authors
(none)
Tags
Stats
Related papers
- Discreteslu: A Large Language Model With Self-supervised Discrete Speech Units For Spoken Language Understanding (2024)5.84
- From Hype To Insight: Rethinking Large Language Model Integration In Visual Speech Recognition (2025)0.00
- Deploying Self-supervised Learning In The Wild For Hybrid Automatic Speech Recognition (2022)0.00
- Towards Supervised Performance On Speaker Verification With Self-supervised Learning By Leveraging Large-scale ASR Models (2024)7.50
- Automatic Pronunciation Assessment Using Self-supervised Speech Representation Learning (2022)0.00
- Efficient Infusion Of Self-supervised Representations In Automatic Speech Recognition (2024)0.00
- Wavlm: Large-scale Self-supervised Pre-training For Full Stack Speech Processing (2021)24.00
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00