E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer
2023 Β· Jacob Zhiyuan Fang, Skyler Zheng, Vasu Sharma, et al.
Abstract
To build scalable models for challenging real-world tasks, it is important to learn from diverse, multi-modal data in various forms (e.g., videos, text, and images). Among the existing works, a plethora of them have focused on leveraging large but cumbersome cross-modal architectures. Regardless of their effectiveness, larger architectures unavoidably prevent the models from being extended to real-world applications, so building a lightweight VL architecture and an efficient learning schema is of great practical value. In this paper, we propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In particular, our E-ViLM learns to reconstruct the semantic labels of masked video regions, produced by the pre-trained vector-quantized tokenizer, which discretizes the continuous visual signals into labels. We show that with our simple MVM task and regular VL pre-training modelings, our E-ViLM, de
Authors
(none)
Tags
Stats
Related papers
- Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models (2025)0.00
- Vidvec: Unlocking Video MLLM Embeddings For Video-text Retrieval (2026)0.00
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- MILES: Visual BERT Pre-training With Injected Language Semantics For Video-text Retrieval (2022)10.61
- Vlm2vec-v2: Advancing Multimodal Embedding For Videos, Images, And Visual Documents (2025)0.00
- EVE: Efficient Vision-language Pre-training With Masked Prediction And Modality-aware Moe (2023)7.50
- VLMAE: Vision-language Masked Autoencoder (2022)0.00
- Vlm2vec: Training Vision-language Models For Massive Multimodal Embedding Tasks (2024)0.00