Tiny-align: Bridging Automatic Speech Recognition And Large Language Model On The Edge
2024 Β· Ruiyang Qin, Dancheng Liu, Gelei Xu, et al.
Abstract
The combination of Large Language Models (LLM) and Automatic Speech Recognition (ASR), when deployed on edge devices (called edge ASR-LLM), can serve as a powerful personalized assistant to enable audio-based interaction for users. Compared to text-based interaction, edge ASR-LLM allows accessible and natural audio interactions. Unfortunately, existing ASR-LLM models are mainly trained in high-performance computing environments and produce substantial model weights, making them difficult to deploy on edge devices. More importantly, to better serve users' personalized needs, the ASR-LLM must be able to learn from each distinct user, given that audio input often contains highly personalized characteristics that necessitate personalized on-device training. Since individually fine-tuning the ASR or LLM often leads to suboptimal results due to modality-specific limitations, end-to-end training ensures seamless integration of audio features and language understanding (cross-modal alignment),
Authors
(none)
Tags
Stats
Related papers
- Integrating Pre-trained Speech And Language Models For End-to-end Speech Recognition (2023)0.00
- A Comprehensive Solution To Connect Speech Encoder And Large Language Model For ASR (2024)0.00
- Exploring The Integration Of Large Language Models Into Automatic Speech Recognition Systems: An Empirical Study (2023)8.09
- Recent Advances In Speech Language Models: A Survey (2024)14.64
- A Multimodal Approach To Device-directed Speech Detection With Large Language Models (2024)7.16
- Seed-asr: Understanding Diverse Speech And Contexts With Llm-based Speech Recognition (2024)0.00
- Omni-avsr: Towards Unified Multimodal Speech Recognition With Large Language Models (2025)2.26
- Multi-stage Large Language Model Correction For Speech Recognition (2023)0.00