Vill-e: Video LLM Embeddings For Retrieval
2026 Β· Rohit Gupta, Jayakrishnan Unnikrishnan, Fan Fei, et al.
Abstract
Video Large Language Models (VideoLLMs) excel at video understanding tasks where outputs are textual, such as Video Question Answering and Video Captioning. However, they underperform specialized embedding-based models in Retrieval tasks, such as Text-toVideo Retrieval and Moment Retrieval. We introduce ViLL-E (Video-LLM-Embed), a unified VideoLLM architecture endowed with a novel embedding generation mechanism that allows the model to "think longer" for complex videos and stop early for easy ones. We train this model with a three-stage training methodology combining generative and contrastive learning: initial large-scale pre-training with video-caption pairs; followed by continual training on a smaller, detailed-caption dataset; and concluding with task-specific fine-tuning on a novel multi-task dataset covering Video QA, Temporal Localization, Video Retrieval, and Video-Text Matching. Our model significantly improves temporal localization (on avg. 7% over other VideoLLMs) and video
Authors
(none)
Tags
Stats
Related papers
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- WAVE: Learning Unified & Versatile Audio-visual Embeddings With Multimodal LLM (2025)0.00
- MERLIN: Multimodal Embedding Refinement Via Llm-based Iterative Navigation For Text-video Retrieval-rerank Pipeline (2024)5.84
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Nv-embed: Improved Techniques For Training Llms As Generalist Embedding Models (2024)0.00