Abstract

Currently, high-quality, synchronized audio is synthesized from video and optional text inputs using various multi-modal joint learning frameworks. However, the precise alignment between the visual and generated audio domains remains far from satisfactory. One key factor is the lack of sufficient temporal and semantic alignment annotations in open-source video-audio and text-audio benchmarks. Therefore, we propose a framework for audio generation from videos, leveraging the internal chain-of-thought (CoT) of a multi-modal large language model (MLLM) to enable step-by-step reasoning without requiring additional annotations. Additionally, a corresponding multi-modal reasoning dataset is constructed to facilitate the learning of initial reasoning in audio generation. In the experiments, we demonstrate the effectiveness of the proposed framework in reducing misalignment (voice-over) in generated audio and achieving competitive performance compared to various state-of-the-art models. The ev

Authors

(none)

Tags

  • Audio Generation
  • Music Generation

Stats

  • citations0
  • S2 citationsβ€”
  • github stars0
  • HF likes0
  • heat score0.00
  • arxiv keyliang2025deepsound

Related papers