Uniaudio 1.5: Large Language Model-driven Audio Codec Is A Few-shot Audio Task Learner
2024 Β· Dongchao Yang, Haohan Guo, Yuanyuan Wang, et al.
Abstract
The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel and LLMs-driven audio codec model, LLM-Codec, to transfer the audio modality into the textual space, \textit\{i.e.\} representing audio tokens with words or sub-words in the vocabulary of LLMs, while keeping high audio reconstruction quality. The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into a well-trained LLMs token space. Thus, the audio representation can be viewed as a new \textit\{foreign language\}, and LLMs can learn the new \textit\{foreign language\} with several demonstrations. In experiments, we investigate the performance of the
Authors
(none)
Tags
Stats
Related papers
- Uniaudio: An Audio Foundation Model Toward Universal Audio Generation (2023)5.56
- Codec Does Matter: Exploring The Semantic Shortcoming Of Codec For Audio Language Model (2024)15.02
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- Improving Audio Codec-based Zero-shot Text-to-speech Synthesis With Multi-modal Context And Large Language Model (2024)2.26
- Mowe-audio: Multitask Audiollms With Mixture Of Weak Encoders (2024)3.58
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24