Realizing Video Summarization From The Path Of Language-based Semantic Understanding
2024 Β· Kuan-Chen Mu, Zhi-Yi Chin, Wei-Chen Chiu
Abstract
The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features and, in some cases, audio features with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically mea
Authors
(none)
Tags
Stats
Related papers
- Prompting Large Language Models With Audio For General-purpose Speech Summarization (2024)6.34
- Large Language Models Are Strong Audio-visual Speech Recognition Learners (2024)9.59
- Semantically Consistent Video-to-audio Generation Using Multimodal Language Large Model (2024)0.00
- Videollama 2: Advancing Spatial-temporal Modeling And Audio Understanding In Video-llms (2024)0.00
- A Review Of Multi-modal Large Language And Vision Models (2024)0.00
- Role Of Audio In Audio-visual Video Summarization (2022)0.00
- Enhancing Multimodal LLM For Detailed And Accurate Video Captioning Using Multi-round Preference Optimization (2024)0.00
- Fine-grained Audio-visual Joint Representations For Multimodal Large Language Models (2023)2.60