Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models
2025 Β· Haichao Zhang, Yun Fu
Abstract
Token-based video representation has emerged as a promising approach for enabling large language models (LLMs) to interpret video content. However, existing token reduction techniques, such as pruning and merging, often disrupt essential positional embeddings and rely on continuous visual tokens sampled from nearby pixels with similar spatial-temporal locations. By removing only a small fraction of tokens, these methods still produce relatively lengthy continuous sequences, which falls short of the extreme compression required to balance computational efficiency and token count in video LLMs. In this paper, we introduce the novel task of Extreme Short Token Reduction, which aims to represent entire videos using a minimal set of discrete tokens. We propose VQToken, a neural discrete token representation framework that (i) applies adaptive vector quantization to continuous ViT embeddings to learn a compact codebook and (ii) preserves spatial-temporal positions via a token hash function b
Authors
(none)
Tags
Stats
Related papers
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00
- One Trajectory, One Token: Grounded Video Tokenization Via Panoptic Sub-object Trajectory (2025)0.00
- Going Down Memory Lane: Scaling Tokens For Video Stream Understanding With Dynamic Kv-cache Memory (2026)0.00
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54
- Litevl: Efficient Video-language Learning With Enhanced Spatial-temporal Modeling (2022)6.34
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- Less Is More: Clipbert For Video-and-language Learning Via Sparse Sampling (2021)25.76
- Revitalize Region Feature For Democratizing Video-language Pre-training Of Retrieval (2022)2.72