COCO
Canonical256papers using it
2016first seen
Papers using COCO (200)
- End-to-end Object Detection With TransformersCenternet: Keypoint Triplets For Object DetectionMasked-attention Mask Transformer For Universal Image SegmentationDeformable DETR: Deformable Transformers for End-to-End Object DetectionDSSD : Deconvolutional Single Shot DetectorGrounding DINO: Marrying DINO With Grounded Pre-training For Open-set Object DetectionTOOD: Task-aligned One-stage Object DetectionRotate To Attend: Convolutional Triplet Attention ModuleDynamic Head: Unifying Object Detection Heads With AttentionsMeshed-memory Transformer For Image CaptioningSimple Copy-paste Is A Strong Data Augmentation Method For Instance SegmentationPointrend: Image Segmentation As RenderingConformer: Local Features Coupling Global Representations For Visual RecognitionAugmentation For Small Object DetectionExploring Plain Vision Transformer Backbones For Object DetectionBounding Box Regression With Uncertainty For Accurate Object DetectionVisual Semantic Reasoning For Image-text MatchingYOLACT++: Better Real-time Instance SegmentationUP-DETR: Unsupervised Pre-training For Object Detection With TransformersNeural Baby TalkExtended Feature Pyramid Network For Small Object DetectionUpsnet: A Unified Panoptic Segmentation NetworkAssociative Embedding: End-to-End Learning for Joint Detection and
GroupingInvolution: Inverting The Inherence Of Convolution For Visual RecognitionTokenpose: Learning Keypoint Tokens For Human Pose EstimationVisual Transformers: Token-based Image Representation And Processing For Computer VisionPropagate Yourself: Exploring Pixel-level Consistency For Unsupervised Visual Representation LearningRecurrent Fusion Network For Image CaptioningRtmdet: An Empirical Study Of Designing Real-time Object DetectorsFrustratingly Simple Few-shot Object DetectionLight-head R-CNN: In Defense Of Two-stage Object DetectorAttention-guided Unified Network For Panoptic SegmentationDense Distinct Query For End-to-end Object DetectionStand-alone Self-attention In Vision ModelsDetection In Crowded Scenes: One Proposal, Multiple PredictionsK-net: Towards Unified Image SegmentationPosition Focused Attention Network For Image-text MatchingAutoassign: Differentiable Label Assignment For Dense Object DetectionCut And Learn For Unsupervised Object Detection And Instance SegmentationEfficient DETR: Improving End-to-end Object Detector With Dense PriorCornerNet: Detecting Objects as Paired KeypointsMask Transfiner For High-quality Instance SegmentationYOLO-MS: Rethinking Multi-scale Representation Learning For Real-time Object DetectionLearning Meta-class Memory For Few-shot Semantic SegmentationHrformer: High-resolution Transformer For Dense PredictionTowards Local Visual Modeling For Image CaptioningTraining Object Class Detectors With Click SupervisionDPT: Deformable Patch-based Transformer For Visual RecognitionCorner Proposal Network For Anchor-free, Two-stage Object DetectionCPTR: Full Transformer Network for Image CaptioningRich Image Captioning In The WildObject Detection In Equirectangular PanoramaAutofocus: Efficient Multi-scale InferenceBottom-up And Top-down Attention For Image Captioning And Visual Question AnsweringImage Captioning: Transforming Objects into WordsCentermask: Single Shot Instance Segmentation With Point RepresentationTraining-time-friendly Network For Real-time Object DetectionTJU-DHD: A Diverse High-resolution Dataset For Object DetectionPoints As Queries: Weakly Semi-supervised Object Detection By PointsPointing Novel Objects In Image CaptioningDense Learning Based Semi-supervised Object DetectionTiny-dsod: Lightweight Object Detection For Resource-restricted UsagesMulti-instance Pose Networks: Rethinking Top-down Pose EstimationFcpose: Fully Convolutional Multi-person Pose Estimation With Dynamic Instance-aware ConvolutionsLearning Instance Occlusion For Panoptic SegmentationConv2Former: A Simple Transformer-Style ConvNet for Visual RecognitionCausal Intervention for Weakly-Supervised Semantic SegmentationRTMO: Towards High-performance One-stage Real-time Multi-person Pose EstimationOne-Shot Instance SegmentationSelf-EMD: Self-Supervised Object Detection without ImageNetComprehensive Attention Self-Distillation for Weakly-Supervised Object
DetectionPPT: Token-pruned Pose Transformer For Monocular And Multi-view Human Pose EstimationFace Detection Using Improved Faster RCNNDite-HRNet: Dynamic Lightweight High-Resolution Network for Human Pose EstimationRevisiting Feature Alignment For One-stage Object DetectionYou Only Segment Once: Towards Real-time Panoptic SegmentationImproving Image Captioning By Leveraging Knowledge GraphsDont Even Look Once: Synthesizing Features For Zero-shot DetectionTFPose: Direct Human Pose Estimation with TransformersZero-shot Instance SegmentationISTR: End-to-End Instance Segmentation with TransformersDerpn: Taking A Further Step Toward More General Object DetectionSemi-autoregressive Transformer For Image CaptioningCascade-detr: Delving Into High-quality Universal Object Detection5%>100%: Breaking Performance Shackles Of Full Fine-tuning On Visual Recognition TasksSaccadenet: A Fast And Accurate Object DetectorSparse Semi-detr: Sparse Learnable Queries For Semi-supervised Object DetectionFully Convolutional Instance-aware Semantic SegmentationDETR For Crowd Pedestrian DetectionPixel Consensus Voting For Panoptic SegmentationSolving Missing-annotation Object Detection With Background Recalibration LossMViTv2: Improved Multiscale Vision Transformers for Classification and
DetectionMvitv2: Improved Multiscale Vision Transformers For Classification And DetectionRODEO: Replay For Online Object DetectionForeground-background Imbalance Problem In Deep Object Detectors: A ReviewCrcnet: Few-shot Segmentation With Cross-reference And Region-global Conditional NetworksJoint Coordinate Regression And Association For Multi-person Pose Estimation, A Pure Neural Network ApproachACORT: A Compact Object Relation Transformer For Parameter Efficient Image CaptioningDynamic Scale Training For Object DetectionDiffusionInst: Diffusion Model for Instance SegmentationRe-scoring Using Image-language Similarity For Few-shot Object DetectionImplicit Feature Pyramid Network for Object DetectionMulti-class Token Transformer for Weakly Supervised Semantic
SegmentationMask DINO: Towards A Unified Transformer-based Framework For Object Detection And SegmentationLP-OVOD: Open-vocabulary Object Detection By Linear ProbingAttend Refine Repeat: Active Box Proposal Generation Via In-out LocalizationCoconut: Modernizing COCO SegmentationCheaper Pre-training Lunch: An Efficient Paradigm For Object DetectionAnchor Pruning For Object DetectionEnd-to-End Object Detection with Fully Convolutional NetworkZero-shot Object DetectionSingle-shot Object Detection With Enriched SemanticsRest V2: Simpler, Faster And StrongerDAP: Detection-aware Pre-training With Weak SupervisionInstance Segmentation With Point SupervisionSlender Object Detection: Diagnoses And ImprovementsDPNET: Dual-path Network For Efficient Object Detectioj With Lightweight Self-attentionMFPN: A Novel Mixture Feature Pyramid Network Of Multiple Architectures For Object DetectionBenchmarking Performance Of Object Detection Under Image Distortions In An Uncontrolled EnvironmentDeep Occlusion-Aware Instance Segmentation with Overlapping BiLayersCAT: Cross-Attention Transformer for One-Shot Object DetectionHierarchical Attention Network for Few-Shot Object Detection via
Meta-Contrastive LearningPlain-det: A Plain Multi-dataset Object DetectorA Novel Attention-based Aggregation Function To Combine Vision And LanguageUnderstanding Gaussian Attention Bias of Vision Transformers Using
Effective Receptive FieldsRethinking Classification And Localization For Cascade R-CNNAnchor-intermediate Detector: Decoupling And Coupling Bounding Boxes For Accurate Object DetectionRemax: Relaxing For Better Training On Efficient Panoptic SegmentationSe-psnet: Silhouette-based Enhancement Feature For Panoptic Segmentation NetworkLearning From Noisy Anchors For One-stage Object DetectionNMS Strikes BackFeasibility Of Inconspicuous Gan-generated Adversarial Patches Against Object DetectionCrosskd: Cross-head Knowledge Distillation For Object DetectionInstances As QueriesContrastive Object Detection Using Knowledge Graph EmbeddingsDetect Everything With Few ExamplesT-VSE: Transformer-Based Visual Semantic EmbeddingSpatial Reasoning for Few-Shot Object DetectionSsfpn: Scale Sequence (S^2) Feature Based-feature Pyramid Network For Object DetectionReducing Label Noise In Anchor-free Object DetectionLocally Enhanced Self-attention: Combining Self-attention And Convolution As Local And Context TermsAnalysis of Visual Reasoning on One-Stage Object DetectionFew-Shot Object Detection with Fully Cross-TransformerHCFormer: Unified Image Segmentation with Hierarchical ClusteringTransformer Based Multitask Learning For Image Captioning And Object DetectionLanguage-conditioned Detection TransformerExploring Semantic Relationships For Unpaired Image CaptioningGeometry Attention Transformer With Position-aware Lstms For Image CaptioningA Tri-layer Plugin To Improve Occluded DetectionWhat Are Expected Queries In End-to-end Object Detection?Real-time Transformer-based Open-vocabulary Detection With Efficient Fusion HeadPPT: token-Pruned Pose Transformer for monocular and multi-view human
pose estimationReal-time Panoptic Segmentation From Dense DetectionsFastmask: Segment Multi-scale Object Candidates In One ShotA Single-shot Object Detector With Feature Aggragation And EnhancementGreedy Offset-guided Keypoint Grouping For Human Pose EstimationSparseformer: Detecting Objects In HRW Shots Via Sparse Vision TransformerUnifying Visual Perception By Dispersible Points LearningAddressing The Challenges Of Open-world Object DetectionSOS: Segment Object System For Open-world Instance Segmentation With Object PriorsFeature-Driven Super-Resolution for Object DetectionDETReg: Unsupervised Pretraining with Region Priors for Object DetectionTask Specific Attention is one more thing you need for object detectionTime-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot
Object DetectionCan the Query-based Object Detector Be Designed with Fewer Stages?A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask
InpaintingData Augmentation To Improve Robustness Of Image Captioning SolutionsSEA: Bridging The Gap Between One- And Two-stage Detector Distillation Via Semantic-aware AlignmentLINEA: Fast And Accurate Line Detection Using Scalable TransformersA Comparative Attention Framework For Better Few-shot Object Detection On Aerial ImagesPose-mum : Reinforcing Key Points Relationship For Semi-supervised Human Pose EstimationNatural Adversarial ObjectsColmix -- A Simple Data Augmentation Framework To Improve Object Detector Performance And Robustness In Aerial ImagesRevisiting DETR Pre-training For Object DetectionFast Hierarchical Learning For Few-shot Object DetectionEqualization Loss For Large Vocabulary Instance SegmentationZero-shot Object Detection Through Vision-language Embedding AlignmentIvaNet: Learning to jointly detect and segment objets with the help of
Local Top-Down ModulesLearning to Inpaint by Progressively Growing the Mask RegionsImage Captioning using Multiple Transformers for Self-Attention
MechanismModulating Localization and Classification for Harmonized Object
DetectionUnsupervised Discovery of the Long-Tail in Instance Segmentation Using
Hierarchical Self-SupervisionPoseur: Direct Human Pose Regression with TransformersACORT: A Compact Object Relation Transformer for Parameter Efficient
Image CaptioningCRCNet: Few-shot Segmentation with Cross-Reference and Region-Global
Conditional NetworksSeqCo-DETR: Sequence Consistency Training for Self-Supervised Object
Detection with TransformersCLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
Segmentation For-FreeCOMNet: Co-Occurrent Matching for Weakly Supervised Semantic
SegmentationDECO: Unleashing the Potential of ConvNets for Query-based Detection and
SegmentationProgressive Token Length Scaling in Transformer Encoders for Efficient
Universal SegmentationA Simple and Generalist Approach for Panoptic SegmentationCOCO-OLAC: A Benchmark for Occluded Panoptic Segmentation and Image
UnderstandingWaterfall Transformer for Multi-person Pose EstimationSelf-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image GenerationPyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose EstimationPractical Insights into Semi-Supervised Object Detection ApproachesEnhancing Open-Vocabulary Object Detection through Multi-Level Fine-Grained Visual-Language AlignmentLe-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder DesignWhat Helps---and What Hurts: Bidirectional Explanations for Vision TransformersExploring Open-Vocabulary Object Recognition in Images using CLIP