Awesome Video Understanding
Video Understanding is one of the most active areas in Awesome Computer Vision β 1,343 papers in this collection, evaluated on datasets like YouTube-VOS, DAVIS 2017, Kinetics. A strong starting point is "A Survey On Visual Transformer".
Datasets & benchmarks
Key papers
- A Survey On Visual Transformer (2020)Kai Han, Yunhe Wang, Hanting Chen, et al.26.80
- Learning Salient Boundary Feature For Anchor-free Temporal Action Localization (2021)Chuming Lin, Chengming Xu, Donghao Luo, et al.22.47
- Tubetk: Adopting Tubes To Track Multi-object In A One-step Training Model (2020)Bo Pang, Yizhuo Li, Yifan Zhang, et al.22.36
- How To Train Your Deep Multi-object Tracker (2019)Yihong Xu, Aljosa Osep, Yutong Ban, et al.22.35
- Deep Affinity Network For Multiple Object Tracking (2018)Shijie Sun, Naveed Akhtar, Huansheng Song, et al.22.06
- Tracking Without Bells And Whistles (2019)Philipp Bergmann, Tim Meinhardt, Laura Leal-Taixe22.04
- Collaborative Video Object Segmentation By Foreground-background Integration (2020)Zongxin Yang, Yunchao Wei, Yi Yang22.02
- Ranet: Ranking Attention Network For Fast Video Object Segmentation (2019)Ziqin Wang, Jun Xu, Li Liu, et al.21.97
- Background Suppression Network For Weakly-supervised Temporal Action Localization (2019)Pilhyeon Lee, Youngjung Uh, Hyeran Byun21.84
- Vadclip: Adapting Vision-language Models For Weakly Supervised Video Anomaly Detection (2023)Peng Wu, Xuerong Zhou, Guansong Pang, et al.20.92
- Video Object Segmentation Using Space-time Memory Networks (2019)Seoung Wug Oh, Joon-Young Lee, Ning Xu, et al.20.78
- Swiftnet: Real-time Video Object Segmentation (2021)Haochen Wang, Xiaolong Jiang, Haibing Ren, et al.20.57
- Relaxed Transformer Decoders For Direct Action Proposal Generation (2021)Jing Tan, Jiaqi Tang, Limin Wang, et al.20.49
- MOTS: Multi-object Tracking And Segmentation (2019)Paul Voigtlaender, Michael Krause, Aljosa Osep, et al.20.41
- Youtube-vos: Sequence-to-sequence Video Object Segmentation (2018)Ning Xu, Linjie Yang, Yuchen Fan, et al.20.04
- 3c-net: Category Count And Center Loss For Weakly-supervised Action Localization (2019)Sanath Narayan, Hisham Cholakkal, Fahad Shahbaz Khan, et al.19.77
- Quasi-dense Similarity Learning For Multiple Object Tracking (2020)Jiangmiao Pang, Linlu Qiu, Xia Li, et al.19.58
- Mhformer: Multi-hypothesis Transformer For 3D Human Pose Estimation (2021)Wenhao Li, Hong Liu, Hao Tang, et al.19.47
- Bottom-up Temporal Action Localization With Mutual Regularization (2020)Peisen Zhao, Lingxi Xie, Chen Ju, et al.19.32
- Single Shot Temporal Action Detection (2017)Tianwei Lin, Xu Zhao, Zheng Shou19.29
- Action Segmentation With Joint Self-supervised Temporal Domain Adaptation (2020)Min-Hung Chen, Baopu Li, Yingze Bao, et al.18.98
- Learning To Estimate Hidden Motions With Global Motion Aggregation (2021)Shihao Jiang, Dylan Campbell, Yao Lu, et al.18.71
- Cross-view Tracking For Multi-human 3D Pose Estimation At Over 100 FPS (2020)Long Chen, Haizhou Ai, Rui Chen, et al.18.65
- Dancetrack: Multi-object Tracking In Uniform Appearance And Diverse Motion (2021)Peize Sun, Jinkun Cao, Yi Jiang, et al.18.56
- M2TR: Multi-modal Multi-scale Transformers For Deepfake Detection (2021)Junke Wang, Zuxuan Wu, Wenhao Ouyang, et al.18.46
- Foreground Segmentation Using A Triplet Convolutional Neural Network For Multiscale Feature Encoding (2018)Long Ang Lim, Hacer Yalim Keles18.15
- Hourglass Tokenizer For Efficient Transformer-based 3D Human Pose Estimation (2023)Wenhao Li, Mengyuan Liu, Hong Liu, et al.17.81
- Spatial-temporal Relation Networks For Multi-object Tracking (2019)Jiarui Xu, Yue Cao, Zheng Zhang, et al.17.66
- Few-shot Video Classification Via Temporal Alignment (2019)Kaidi Cao, Jingwei Ji, Zhangjie Cao, et al.17.63
- Humans In 4D: Reconstructing And Tracking Humans With Transformers (2023)Shubham Goel, Georgios Pavlakos, Jathushan Rajasegaran, et al.17.58
- Blazingly Fast Video Object Segmentation With Pixel-wise Metric Learning (2018)Yuhua Chen, Jordi Pont-Tuset, Alberto Montes, et al.17.46
- Relation Distillation Networks For Video Object Detection (2019)Jiajun Deng, Yingwei Pan, Ting Yao, et al.17.32
- VRSTC: Occlusion-free Video Person Re-identification (2019)Ruibing Hou, Bingpeng Ma, Hong Chang, et al.17.14
- Learning To Track With Object Permanence (2021)Pavel Tokmakov, Jie Li, Wolfram Burgard, et al.17.11
- Temporal-relational Crosstransformers For Few-shot Action Recognition (2021)Toby Perrett, Alessandro Masullo, Tilo Burghardt, et al.17.06
- Spatial Temporal Transformer Network For Skeleton-based Action Recognition (2020)Chiara Plizzari, Marco Cannici, Matteo Matteucci17.06
- Sg-net: Spatial Granularity Network For One-stage Video Instance Segmentation (2021)Dongfang Liu, Yiming Cui, Wenbo Tan, et al.17.02
- Dmm-net: Differentiable Mask-matching Network For Video Object Segmentation (2019)Xiaohui Zeng, Renjie Liao, Li Gu, et al.17.02
- Context-aware RCNN: A Baseline For Action Detection In Videos (2020)Jianchao Wu, Zhanghui Kuang, Limin Wang, et al.16.93
- Modular Interactive Video Object Segmentation: Interaction-to-mask, Propagation And Difference-aware Fusion (2021)Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang16.86
- Arttrack: Articulated Multi-person Tracking In The Wild (2016)Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, et al.16.82
- Scsampler: Sampling Salient Clips From Video For Efficient Action Recognition (2019)Bruno Korbar, Du Tran, Lorenzo Torresani16.77
- Multi-granularity Generator For Temporal Action Proposal (2018)Yuan Liu, Lin Ma, Yifeng Zhang, et al.16.67
- Video Re-localization (2018)Yang Feng, Lin Ma, Wei Liu, et al.16.62
- STA: Spatial-temporal Attention For Large-scale Video-based Person Re-identification (2018)Yang Fu, Xiaoyang Wang, Yunchao Wei, et al.16.61
- Leveraging Photometric Consistency Over Time For Sparsely Supervised Hand-object Reconstruction (2020)Yana Hasson, Bugra Tekin, Federica Bogo, et al.16.53
- Enhanced Spatio-temporal Interaction Learning For Video Deraining: A Faster And Better Framework (2021)Kaihao Zhang, Dongxu Li, Wenhan Luo, et al.16.51
- A Generative Appearance Model For End-to-end Video Object Segmentation (2018)Joakim Johnander, Martin Danelljan, Emil Brissman, et al.16.49
- Frozen CLIP Models Are Efficient Video Learners (2022)Ziyi Lin, Shijie Geng, Renrui Zhang, et al.16.47
- End-to-end Referring Video Object Segmentation With Multimodal Transformers (2021)Adam Botach, Evgenii Zheltonozhskii, Chaim Baskin16.45
- Video Panoptic Segmentation (2020)Dahun Kim, Sanghyun Woo, Joon-Young Lee, et al.16.19
- Fuseformer: Fusing Fine-grained Information In Transformers For Video Inpainting (2021)Rui Liu, Hanming Deng, Yangyi Huang, et al.16.19
- Efficient Regional Memory Network For Video Object Segmentation (2021)Haozhe Xie, Hongxun Yao, Shangchen Zhou, et al.16.19
- Kernelized Memory Network For Video Object Segmentation (2020)Hongje Seong, Junhyuk Hyun, Euntai Kim16.14
- Weakly-supervised Action Localization With Background Modeling (2019)Phuc Xuan Nguyen, Deva Ramanan, Charless C. Fowlkes16.10
- MAMBA: Multi-level Aggregation Via Memory Bank For Video Object Detection (2024)Guanxiong Sun, Yang Hua, Guosheng Hu, et al.16.09
- An Efficient And Layout-independent Automatic License Plate Recognition System Based On The YOLO Detector (2019)Rayson Laroca, Luiz A. Zanlorensi, Gabriel R. GonΓ§alves, et al.16.05
- Video Transformers: A Survey (2022)Javier Selva, Anders S. Johansen, Sergio Escalera, et al.15.98
- Occluded Video Instance Segmentation: A Benchmark (2021)Jiyang Qi, Yan Gao, Yao Hu, et al.15.95
- Sodformer: Streaming Object Detection With Transformer Using Events And Frames (2023)Dianze Li, Jianing Li, Yonghong Tian15.94