Enhancing Multimodal LLM For Detailed And Accurate Video Captioning Using Multi-round Preference Optimization
2024 Β· Changli Tang, Yixuan Li, Yudong Yang, et al.
Abstract
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using DPO. To further improve training, we introduce a novel multi-round DPO (mrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initializing the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilize the process. To address potential catastrophic forgetting of non-captioning abilities due to mrDPO, we propose rebirth tuning, which finetunes the pre-DPO LLM b
Authors
(none)
Tags
Stats
Related papers
- Enhancing Automated Audio Captioning Via Large Language Models With Optimized Audio Encoding (2024)5.24
- Realizing Video Summarization From The Path Of Language-based Semantic Understanding (2024)0.00
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- SLAM-AAC: Enhancing Audio Captioning With Paraphrasing Augmentation And Clap-refine Through Llms (2024)0.00
- Multimodal Large Language Models With Fusion Low Rank Adaptation For Device Directed Speech Detection (2024)0.00
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Aligning Generative Speech Enhancement With Perceptual Feedback (2025)0.00
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60