One Trajectory, One Token: Grounded Video Tokenization Via Panoptic Sub-object Trajectory
2025 Β· Chenhao Zheng, Jieyu Zhang, Mohammadreza Salehi, et al.
Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin
Authors
(none)
Tags
Stats
Related papers
- Vqtoken: Neural Discrete Token Representation Learning For Extreme Token Reduction In Video Large Language Models (2025)0.00
- Centerclip: Token Clustering For Efficient Text-video Retrieval (2022)15.54
- Video-language Alignment Via Spatio-temporal Graph Transformer (2024)0.00
- Tempme: Video Temporal Token Merging For Efficient Text-video Retrieval (2024)2.86
- T2vparser: Adaptive Decomposition Tokens For Partial Alignment In Text To Video Retrieval (2025)0.95
- Ts2-net: Token Shift And Selection Transformer For Text-video Retrieval (2022)15.51
- Understanding The Effect Of Using Semantically Meaningful Tokens For Visual Representation Learning (2024)0.00
- E-vilm: Efficient Video-language Model Via Masked Video Modeling With Semantic Vector-quantized Tokenizer (2023)0.00