From Alignment To Advancement: Bootstrapping Audio-language Alignment With Synthetic Data
2025 Β· Chun-Yi Kuan, Hung-Yi Lee
Abstract
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. This adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contras
Authors
(none)
Tags
Stats
Related papers
- Audiolm: A Language Modeling Approach To Audio Generation (2022)18.91
- A Framework For Synthetic Audio Conversations Generation Using Large Language Models (2024)3.58
- MATS: An Audio Language Model Under Text-only Supervision (2025)0.00
- Audiotoolagent: An Agentic Framework For Audio-language Models (2025)2.60
- Audiosetcaps: An Enriched Audio-caption Dataset Using Automated Generation Pipeline With Large Audio And Language Models (2024)13.44
- Measuring Audio's Impact On Correctness: Audio-contribution-aware Post-training Of Large Audio Language Models (2025)0.00
- BLSP: Bootstrapping Language-speech Pre-training Via Behavior Alignment Of Continuation Writing (2023)0.00
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models (2024)0.00