Lauragpt: Listen, Attend, Understand, And Regenerate Audio With GPT
2023 Β· Zhihao Du, Jiaming Wang, Qian Chen, et al.
Abstract
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks, and have shown great potential as backbones for audio-and-text large language models (LLMs). Previous mainstream audio-and-text LLMs use discrete audio tokens to represent both input and output audio; however, they suffer from performance degradation on tasks such as automatic speech recognition, speech-to-text translation, and speech enhancement over models using continuous speech features. In this paper, we propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation. LauraGPT is a versatile LLM that can process both audio and text inputs and generate outputs in either modalities. We propose a novel data representation that combines continuous and discrete features for audio: LauraGPT encodes input audio into continuous representations using an audio encoder and generates output audio from discrete co
Authors
(none)
Tags
Stats
Related papers
- Generative Pre-trained Speech Language Model With Efficient Hierarchical Transformer (2024)5.96
- Paralinguistics-enhanced Large Language Modeling Of Spoken Dialogue (2023)0.00
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Audioldm 2: Learning Holistic Audio Generation With Self-supervised Pretraining (2023)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Pengi: An Audio Language Model For Audio Tasks (2023)10.35
- Text-to-audio Generation Using Instruction-tuned LLM And Latent Diffusion Model (2023)0.00