Lazyvlm: Neuro-symbolic Approach To Video Analytics
2025 Β· Xiangru Jian, Wei Pang, Zhengyuan Dong, et al.
Abstract
Current video analytics approaches face a fundamental trade-off between flexibility and efficiency. End-to-end Vision Language Models (VLMs) often struggle with long-context processing and incur high computational costs, while neural-symbolic methods depend heavily on manual labeling and rigid rule design. In this paper, we introduce LazyVLM, a neuro-symbolic video analytics system that provides a user-friendly query interface similar to VLMs, while addressing their scalability limitation. LazyVLM enables users to effortlessly drop in video data and specify complex multi-frame video queries using a semi-structured text interface for video analytics. To address the scalability limitations of VLMs, LazyVLM decomposes multi-frame video queries into fine-grained operations and offloads the bulk of the processing to efficient relational query execution and vector similarity search. We demonstrate that LazyVLM provides a robust, efficient, and user-friendly solution for querying open-domain
Authors
(none)
Tags
Stats
Related papers
- LOVO: Efficient Complex Object Query In Large-scale Video Datasets (2025)2.26
- TV-RAG: A Temporal-aware And Semantic Entropy-weighted Framework For Long Video Retrieval And Understanding (2025)2.86
- V-agent: An Interactive Video Search System Using Vision-language Models (2025)0.00
- Verve: Versatile Retrieval For Videos Via Unified Embeddings (2026)0.00
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- Meta-personalizing Vision-language Models To Find Named Instances In Video (2023)8.60
- Vision-language Models Learn Super Images For Efficient Partially Relevant Video Retrieval (2023)3.58
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34