Prior Knowledge Integration Via LLM Encoding And Pseudo Event Regulation For Video Moment Retrieval
2024 Β· Yiyang Jiang, Wengyu Zhang, Xulu Zhang, et al.
Abstract
In this paper, we investigate the feasibility of leveraging large language models (LLMs) for integrating general knowledge and incorporating pseudo-events as priors for temporal content distribution in video moment retrieval (VMR) models. The motivation behind this study arises from the limitations of using LLMs as decoders for generating discrete textual descriptions, which hinders their direct application to continuous outputs like salience scores and inter-frame embeddings that capture inter-frame relations. To overcome these limitations, we propose utilizing LLM encoders instead of decoders. Through a feasibility study, we demonstrate that LLM encoders effectively refine inter-concept relations in multimodal embeddings, even without being trained on textual embeddings. We also show that the refinement capability of LLM encoders can be transferred to other embeddings, such as BLIP and T5, as long as these embeddings exhibit similar inter-concept similarity patterns to CLIP embedding
Authors
(none)
Tags
Stats
Related papers
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- Unified Interactive Multimodal Moment Retrieval Via Cascaded Embedding-reranking And Temporal-aware Score Fusion (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Vill-e: Video LLM Embeddings For Retrieval (2026)0.00
- Hybrid-learning Video Moment Retrieval Across Multi-domain Labels (2024)0.00
- Bidirectional Likelihood Estimation With Multi-modal Large Language Models For Text-video Retrieval (2025)2.76
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84