Awesome 3D Vision
3D Vision is one of the most active areas in Awesome Computer Vision β 1,899 papers in this collection, evaluated on datasets like ImageNet, COCO, KITTI. A strong starting point is "Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows".
Datasets & benchmarks
Key papers
- Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows (2021)Ze Liu, Yutong Lin, Yue Cao, et al.38.40
- Pyramid Vision Transformer: A Versatile Backbone For Dense Prediction Without Convolutions (2021)Wenhai Wang, Enze Xie, Xiang Li, et al.33.76
- Vision Transformers For Dense Prediction (2021)RenΓ© Ranftl, Alexey Bochkovskiy, Vladlen Koltun31.27
- Tokens-to-token Vit: Training Vision Transformers From Scratch On Imagenet (2021)Li Yuan, Yunpeng Chen, Tao Wang, et al.30.72
- Crossvit: Cross-attention Multi-scale Vision Transformer For Image Classification (2021)Chun-Fu Chen, Quanfu Fan, Rameswar Panda29.53
- Higherhrnet: Scale-aware Representation Learning For Bottom-up Human Pose Estimation (2019)Bowen Cheng, Bin Xiao, Jingdong Wang, et al.27.97
- Aanet: Adaptive Aggregation Network For Efficient Stereo Matching (2020)Haofei Xu, Juyong Zhang25.74
- Unified Vision-language Pre-training For Image Captioning And VQA (2019)Luowei Zhou, Hamid Palangi, Lei Zhang, et al.25.70
- Rethinking RGB-D Salient Object Detection: Models, Data Sets, And Large-scale Benchmarks (2019)Deng-Ping Fan, Zheng Lin, Jia-Xing Zhao, et al.24.88
- Acnet: Attention Based Network To Exploit Complementary Features For RGBD Semantic Segmentation (2019)Xinxin Hu, Kailun Yang, Lei Fei, et al.23.95
- Pixelnerf: Neural Radiance Fields From One Or Few Images (2020)Alex Yu, Vickie Ye, Matthew Tancik, et al.23.45
- Learning Depth With Convolutional Spatial Propagation Network (2018)Xinjing Cheng, Peng Wang, Ruigang Yang23.40
- Objects Are Different: Flexible Monocular 3D Object Detection (2021)Yunpeng Zhang, Jiwen Lu, Jie Zhou23.03
- Cswin Transformer: A General Vision Transformer Backbone With Cross-shaped Windows (2021)Xiaoyi Dong, Jianmin Bao, Dongdong Chen, et al.22.72
- Multi-scale Vision Longformer: A New Vision Transformer For High-resolution Image Encoding (2021)Pengchuan Zhang, Xiyang Dai, Jianwei Yang, et al.22.55
- Fast-mvsnet: Sparse-to-dense Multi-view Stereo With Learned Propagation And Gauss-newton Refinement (2020)Zehao Yu, Shenghua Gao22.48
- Patch2pix: Epipolar-guided Pixel-level Correspondences (2020)Qunjie Zhou, Torsten Sattler, Laura Leal-Taixe22.36
- Hand Keypoint Detection In Single Images Using Multiview Bootstrapping (2017)Tomas Simon, Hanbyul Joo, Iain Matthews, et al.22.00
- Pose-guided Visible Part Matching For Occluded Person Reid (2020)Shang Gao, Jingya Wang, Huchuan Lu, et al.21.99
- Vision Transformer With Deformable Attention (2022)Zhuofan Xia, Xuran Pan, Shiji Song, et al.21.78
- CMT: Convolutional Neural Networks Meet Vision Transformers (2021)Jianyuan Guo, Kai Han, Han Wu, et al.21.56
- Mvsnerf: Fast Generalizable Radiance Field Reconstruction From Multi-view Stereo (2021)Anpei Chen, Zexiang Xu, Fuqiang Zhao, et al.21.41
- Vinvl: Revisiting Visual Representations In Vision-language Models (2021)Pengchuan Zhang, Xiujun Li, Xiaowei Hu, et al.21.40
- Ga-net: Guided Aggregation Net For End-to-end Stereo Matching (2019)Feihu Zhang, Victor Prisacariu, Ruigang Yang, et al.21.15
- Vadclip: Adapting Vision-language Models For Weakly Supervised Video Anomaly Detection (2023)Peng Wu, Xuerong Zhou, Guansong Pang, et al.20.92
- UNISURF: Unifying Neural Implicit Surfaces And Radiance Fields For Multi-view Reconstruction (2021)Michael Oechsle, Songyou Peng, Andreas Geiger20.65
- Atloc: Attention Guided Camera Localization (2019)Bing Wang, Changhao Chen, Chris Xiaoxuan Lu, et al.20.25
- Voxel Transformer For 3D Object Detection (2021)Jiageng Mao, Yujing Xue, Minzhe Niu, et al.19.97
- Multiview Detection With Feature Perspective Transformation (2020)Yunzhong Hou, Liang Zheng, Stephen Gould19.78
- Learning Multi-scene Absolute Pose Regression With Transformers (2021)Yoli Shavit, Ron Ferens, Yosi Keller19.67
- Learning Feature Pyramids For Human Pose Estimation (2017)Wei Yang, Shuang Li, Wanli Ouyang, et al.19.60
- Mhformer: Multi-hypothesis Transformer For 3D Human Pose Estimation (2021)Wenhao Li, Hong Liu, Hao Tang, et al.19.47
- PARE: Part Attention Regressor For 3D Human Body Estimation (2021)Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, et al.19.33
- Stronger, Fewer, & Superior: Harnessing Vision Foundation Models For Domain Generalized Semantic Segmentation (2023)Zhixiang Wei, Lin Chen, Yi Jin, et al.19.28
- Scaling Local Self-attention For Parameter Efficient Visual Backbones (2021)Ashish Vaswani, Prajit Ramachandran, Aravind Srinivas, et al.19.20
- Diverse Part Discovery: Occluded Person Re-identification With Part-aware Transformer (2021)Yulin Li, Jianfeng He, Tianzhu Zhang, et al.19.19
- Tokenpose: Learning Keypoint Tokens For Human Pose Estimation (2021)Yanjie Li, Shoukui Zhang, Zhicheng Wang, et al.19.16
- What Do Single-view 3D Reconstruction Networks Learn? (2019)Maxim Tatarchenko, Stephan R. Richter, RenΓ© Ranftl, et al.19.13
- Localvit: Analyzing Locality In Vision Transformers (2021)Yawei Li, Kai Zhang, Jiezhang Cao, et al.19.08
- Swinface: A Multi-task Transformer For Face Recognition, Expression Recognition, Age Estimation And Attribute Estimation (2023)Lixiong Qin, Mei Wang, Chao Deng, et al.19.06
- Cosypose: Consistent Multi-view Multi-object 6D Pose Estimation (2020)Yann LabbΓ©, Justin Carpentier, Mathieu Aubry, et al.19.04
- Instance-level Image Retrieval Using Reranking Transformers (2021)Fuwen Tan, Jiangbo Yuan, Vicente Ordonez19.00
- LAVT: Language-aware Vision Transformer For Referring Image Segmentation (2021)Zhao Yang, Jiaqi Wang, Yansong Tang, et al.18.95
- Image Matching Across Wide Baselines: From Paper To Practice (2020)Yuhe Jin, Dmytro Mishkin, Anastasiia Mishchuk, et al.18.66
- Cross-view Tracking For Multi-human 3D Pose Estimation At Over 100 FPS (2020)Long Chen, Haizhou Ai, Rui Chen, et al.18.65
- Practical Stereo Matching Via Cascaded Recurrent Network With Adaptive Correlation (2022)Jiankun Li, Peisen Wang, Pengfei Xiong, et al.18.52
- Segment As Points For Efficient Online Multi-object Tracking And Segmentation (2020)Zhenbo Xu, Wei Zhang, Xiao Tan, et al.18.40
- Pattern-affinitive Propagation Across Depth, Surface Normal And Semantic Segmentation (2019)Zhenyu Zhang, Zhen Cui, Chunyan Xu, et al.18.35
- Person Re-identification By Camera Correlation Aware Feature Augmentation (2017)Ying-Cong Chen, Xiatian Zhu, Wei-Shi Zheng, et al.18.33
- Sa-det3d: Self-attention Based Context-aware 3D Object Detection (2021)Prarthana Bhattacharyya, Chengjie Huang, Krzysztof Czarnecki18.29
- Single-view View Synthesis With Multiplane Images (2020)Richard Tucker, Noah Snavely18.12
- Cricavpr: Cross-image Correlation-aware Representation Learning For Visual Place Recognition (2024)Feng Lu, Xiangyuan Lan, Lijun Zhang, et al.18.09
- RGB-D Salient Object Detection: A Survey (2020)Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng, et al.18.02
- Self-supervised Pretraining Of 3D Features On Any Point-cloud (2021)Zaiwei Zhang, Rohit Girdhar, Armand Joulin, et al.17.98
- Far3d: Expanding The Horizon For Surround-view 3D Object Detection (2023)Xiaohui Jiang, Shuailin Li, Yingfei Liu, et al.17.93
- Hourglass Tokenizer For Efficient Transformer-based 3D Human Pose Estimation (2023)Wenhao Li, Mengyuan Liu, Hong Liu, et al.17.81
- Clip-reid: Exploiting Vision-language Model For Image Re-identification Without Concrete Text Labels (2022)Siyuan Li, Li Sun, Qingli Li17.74
- Monocular, One-stage, Regression Of Multiple 3D People (2020)Yu Sun, Qian Bao, Wu Liu, et al.17.74
- Rotary Position Embedding For Vision Transformer (2024)Byeongho Heo, Song Park, Dongyoon Han, et al.17.72
- Multi-scale High-resolution Vision Transformer For Semantic Segmentation (2021)Jiaqi Gu, Hyoukjun Kwon, Dilin Wang, et al.17.68