ImageNet
Canonical185papers using it
32HF downloads
0HF likes
2016first seen
Papers using ImageNet (162)
- Swin Transformer: Hierarchical Vision Transformer Using Shifted WindowsTokens-to-token Vit: Training Vision Transformers From Scratch On ImagenetRotate To Attend: Convolutional Triplet Attention ModuleTraining Data-efficient Image Transformers & Distillation Through AttentionGoing Deeper With Image TransformersCMT: Convolutional Neural Networks Meet Vision TransformersConformer: Local Features Coupling Global Representations For Visual RecognitionCoatnet: Marrying Convolution And Attention For All Data SizesExploring Plain Vision Transformer Backbones For Object DetectionMaskgit: Masked Generative Image TransformerInvolution: Inverting The Inherence Of Convolution For Visual RecognitionVisual Transformers: Token-based Image Representation And Processing For Computer VisionFast-scnn: Fast Semantic Segmentation NetworkScaling Local Self-attention For Parameter Efficient Visual BackbonesIntriguing Properties Of Vision TransformersSelf-supervised Learning From Images With A Joint-embedding Predictive ArchitectureStand-alone Self-attention In Vision ModelsMasked Generative DistillationTowards Robust Vision TransformerCvt: Introducing Convolutions To Vision TransformersDynamic Convolutions: Exploiting Spatial Sparsity For Faster InferenceMobilevit: Light-weight, General-purpose, And Mobile-friendly Vision TransformerAll Tokens Matter: Token Labeling For Training Better Vision TransformersKnowledge Distillation Via The Target-aware TransformerHide-and-seek: Forcing A Network To Be Meticulous For Weakly-supervised Object And Action LocalizationSelf-supervised Learning With Swin TransformersDPT: Deformable Patch-based Transformer For Visual RecognitionRethinking The Route Towards Weakly Supervised Object LocalizationObject Detection In Equirectangular PanoramaFastvit: A Fast Hybrid Vision Transformer Using Structural ReparameterizationPretraining Boosts Out-of-domain Robustness For Pose EstimationEfficient Self-supervised Vision Transformers for Representation
LearningEfficient Self-supervised Vision Transformers For Representation LearningSupermix: Supervising The Mixing Data AugmentationConv2Former: A Simple Transformer-Style ConvNet for Visual RecognitionQuadtree Attention For Vision TransformersDearkd: Data-efficient Early Knowledge Distillation For Vision TransformersContnet: Why Not Use Convolution And Transformer At The Same Time?Unimatch V2: Pushing The Limit Of Semi-supervised Semantic SegmentationMVP: Multimodality-guided Visual Pre-trainingTraining Vision Transformers With Only 2040 ImagesVision Transformer PruningSimultaneous Semantic Segmentation And Outlier Detection In Presence Of Domain ShiftRotary Position Embedding For Vision TransformerAn Attention Free TransformerTowards Robust Image Classification Using Sequential Attention ModelsAggregating Global Features Into Local Vision TransformerDual-stream Network For Visual RecognitionUnderstanding The Robustness in Vision TransformersDecomposeme: Simplifying Convnets For End-to-end LearningCut-thumbnail: A Novel Data Augmentation For Convolutional Neural NetworkSuper Vision TransformerScene Image Representation By Foreground, Background And Hybrid FeaturesCo-training Transformer With Videos And Images Improves Action RecognitionMViTv2: Improved Multiscale Vision Transformers for Classification and
DetectionMvitv2: Improved Multiscale Vision Transformers For Classification And DetectionScaling Vision With Sparse Mixture Of ExpertsVitamin: Designing Scalable Vision Models In The Vision-language EraNormalization Matters In Weakly Supervised Object LocalizationHigh-resolution Image Inpainting Using Multi-scale Neural Patch SynthesisMetaformer Is Actually What You Need For VisionVitol: Vision Transformer For Weakly Supervised Object LocalizationConvmlp: Hierarchical Convolutional Mlps For VisionFinding an Unsupervised Image Segmenter in Each of Your Deep Generative
ModelsFinding An Unsupervised Image Segmenter In Each Of Your Deep Generative ModelsScaling Vision TransformersDEYO: DETR With YOLO For End-to-end Object DetectionAdavit: Adaptive Tokens For Efficient Vision TransformerOcclusions For Effective Data Augmentation In Image ClassificationMulti-criteria Token Fusion With One-step-ahead Attention For Efficient Vision TransformersA Unified View of Masked Image ModelingTokenmixup: Efficient Attention-guided Token-level Data Augmentation For TransformersSupmae: Supervised Masked Autoencoders Are Efficient Vision LearnersRest V2: Simpler, Faster And StrongerDAP: Detection-aware Pre-training With Weak SupervisionVisformer: The Vision-friendly TransformerTokenlearner: What Can 8 Learned Tokens Do For Images And Videos?Decoder Denoising Pretraining for Semantic SegmentationTDAF: Top-down Attention Framework For Vision TasksClassifier-agnostic Saliency Map ExtractionLearned Thresholds Token Merging And Pruning For Vision Transformers\(v_kd:\) Improving Knowledge Distillation Using Orthogonal ProjectionsAn Inverse Scaling Law For CLIP TrainingImproving Visual Representation Learning Through Perceptual UnderstandingLightweight Vision Transformer With Cross Feature AttentionUnderstanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive FieldsSp-vit: Learning 2D Spatial Priors For Vision TransformersSalience-based Adaptive Masking: Revisiting Token Dynamics For Enhanced Pre-trainingVision Transformers In 2022: An Update On Tiny ImagenetSpformer: Enhancing Vision Transformer With Superpixel RepresentationSkip-attention: Improving Vision Transformers By Paying Less AttentionNeighborhood Attention TransformerLearning High-level Visual Representations From A Child's Perspective Without Strong Inductive BiasesLocally Enhanced Self-attention: Combining Self-attention And Convolution As Local And Context TermsMultilevel Context Representation For Improving Object RecognitionEnhancing Transformer-based Vision Models: Addressing Feature Map Anomalies Through Novel Optimization StrategiesTo Be Critical: Self-calibrated Weakly Supervised Learning For Salient Object DetectionNested Hierarchical Transformer: Towards Accurate, Data-efficient And Interpretable Visual UnderstandingExploring the Limits of Deep Image Clustering using Pretrained ModelsCouplformer:rethinking Vision Transformer With Coupling Attention MapUnifying Visual Perception By Dispersible Points LearningSPIN: An Empirical Evaluation On Sharing Parameters Of Isotropic NetworksRavitt: Random Vision Transformer TokensDmformer: Closing The Gap Between CNN And Vision TransformersPositional Label For Self-supervised Vision TransformerPixel ObjectnessSemi-Supervised Vision TransformersOVO: One-shot Vision Transformer Search with Online distillationPatch Is Not All You NeedBeyond Pixels: Enhancing LIME with Hierarchical Features and Segmentation Foundation ModelsImprove Supervised Representation Learning With Masked Image ModelingWhat Makes For Hierarchical Vision Transformer?Keypoint Aware Masked Image ModellingZero-shot Object Detection Through Vision-language Embedding AlignmentAFIDAF: Alternating Fourier And Image Domain Adaptive Filters As An Efficient Alternative To Attention In VitsImproving Visual Representation Learning through Perceptual
UnderstandingDenseDINO: Boosting Dense Self-Supervised Learning with Token-Based
Point-Level ConsistencyMultistep Distillation of Diffusion Models via Moment MatchingAFIDAF: Alternating Fourier and Image Domain Adaptive Filters as an
Efficient Alternative to Attention in ViTsLatent Forcing: Reordering the Diffusion Trajectory for Pixel-Space Image GenerationLe-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder DesignTriLite: Efficient Weakly Supervised Object Localization with Universal Visual Features and Tri-Region DisentanglementRAViT: Resolution-Adaptive Vision TransformerWhat Helps---and What Hurts: Bidirectional Explanations for Vision TransformersLarge Language Models Facilitate Vision Reflection In Image ClassificationSpuriosity Rankings For Free: A Simple Framework For Last Layer Retraining Based On Object DetectionA Pytorch Reproduction Of Masked Generative Image TransformerPlayer Re-identification Using Body Part AppearencesRetcompletion:high-speed Inference Image Completion With Retentive NetworkEfficient Self-supervised Vision Pretraining With Local Masked ReconstructionVit-p: Rethinking Data-efficient Vision Transformers From LocalityImage Clustering Via The Principle Of Rate Reduction In The Age Of Pretrained ModelsCs-mixer: A Cross-scale Vision MLP Model With Spatial-channel MixingMultimodal Autoregressive Pre-training Of Large Vision EncodersImproving Progressive Generation With Decomposable Flow MatchingLearnings From Scaling Visual Tokenizers For Reconstruction And GenerationMambavision: A Hybrid Mamba-transformer Vision BackboneObject-level Self-distillation For Vision PretrainingImpact Of Light And Shadow On Robustness Of Deep Neural NetworksIwin Transformer: Hierarchical Vision Transformer Using Interleaved WindowsFusion Of Regional And Sparse Attention In Vision TransformersParameter Reduction Improves Vision Transformers: A Comparative Study Of Sharing And Width ReductionRgb-based Semantic Segmentation Using Self-supervised Depth Pre-trainingOn The Surprising Effectiveness Of Attention Transfer For Vision TransformersPatchdropout: Economizing Vision Transformers Using Patch DropoutAtoken: A Unified Tokenizer For VisionSpiralmlp: A Lightweight Vision MLP ArchitectureVote&mix: Plug-and-play Token Reduction For Efficient Vision TransformerStrait: Non-autoregressive Generation With Stratified Image TransformerZero-shot Object Detection: Learning To Simultaneously Recognize And Localize Novel ConceptsMabvit -- Modified Attention Block Enhances Vision TransformersParformer: A Vision Transformer With Parallel Mixer And Sparse Channel Attention Patch EmbeddingR-FCN-3000 At 30fps: Decoupling Detection And ClassificationTosa: Token Selective Attention For Efficient Vision TransformersResformer: Scaling Vits With Multi-resolution TrainingDeit III: Revenge Of The VitCOMCAT: Towards Efficient Compression And Customization Of Attention-based Vision ModelsCentroid-centered Modeling For Efficient Vision Transformer Pre-trainingPvit: Prior-augmented Vision Transformer For Out-of-distribution DetectionWavelet-based Image Tokenizer For Vision TransformersCRAFT Objects from ImagesEfficient Visual Pretraining with Contrastive Detection