Meta-personalizing Vision-language Models To Find Named Instances In Video
2023 Β· Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, et al.
Abstract
Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as ``My dog Biscuit'' appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VL
Authors
(none)
Tags
Stats
Related papers
- Personalization Toolkit: Training Free Personalization Of Large Vision Language Models (2026)0.00
- "this Is My Unicorn, Fluffy": Personalizing Frozen Vision-language Representations (2022)12.81
- Improving Personalized Search With Regularized Low-rank Parameter Updates (2025)0.00
- Context-enhanced Video Moment Retrieval With Large Language Models (2024)5.84
- V-agent: An Interactive Video Search System Using Vision-language Models (2025)0.00
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- A Little More Like This: Text-to-image Retrieval With Vision-language Models Using Relevance Feedback (2025)0.00