Mm-narrator: Narrating Long-form Videos With Multimodal In-context Learning
2023 Β· Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, et al.
Abstract
We present MM-Narrator, a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips, MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths, even beyond hours, in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process, which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information, including storylines and character identities, ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator, we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability vi
Authors
(none)
Tags
Stats
Related papers
- End-to-end Generative Pretraining For Multimodal Video Captioning (2022)15.85
- Audio-agent: Leveraging Llms For Audio Generation, Editing And Composition (2024)0.00
- Classifier-guided Captioning Across Modalities (2025)0.00
- Next-gpt: Any-to-any Multimodal LLM (2023)0.00
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Deepsound-v1: Start To Think Step-by-step In The Audio Generation From Videos (2025)0.00
- M2-omni: Advancing Omni-mllm For Comprehensive Modality Support With Competitive Performance (2025)0.00
- Speechgpt: Empowering Large Language Models With Intrinsic Cross-modal Conversational Abilities (2023)16.59