MS COCO
Emerging41papers using it
2023first seen
MS-COCO is a large-scale dataset that contains images and their corresponding captions, used to evaluate models in tasks such as image captioning and cross-modal retrieval.
Papers using MS COCO (41)
- Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal ReasoningSmartclip: Modular Vision-language Alignment With Identification GuaranteesRobust Multimodal Learning Via Entropy-gated Contrastive FusionOverthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language ModelsMM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail DataAnatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language ModelsGoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image PretrainingORIC: Benchmarking Object Recognition Under Contextual Incongruity In Large Vision-language ModelsSEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignmentCross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain ModelingLanguage-Guided Invariance Probing of Vision-Language ModelsCure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language ModelsExplaining Similarity in Vision-Language Encoders with Weighted Banzhaf InteractionsConcept Regions Matter: Benchmarking CLIP With A New Cluster-importance ApproachCoco-urdu: A Large-scale Urdu Image-caption Dataset With Multimodal Quality EstimationSpec-llava: Accelerating Vision-language Models With Dynamic Tree-based Speculative DecodingFrom Pixels And Words To Waves: A Unified Framework For Spectral Dictionary VllmsLeveraging Vision-language Pre-training For Human Activity Recognition In Still ImagesMining Contextualized Visual Associations From Images For Creativity UnderstandingA Good CREPE Needs More Than Just Sugar: Investigating Biases In Compositional Vision-language BenchmarksOne Object, Multiple Lies: A Benchmark For Cross-task Adversarial Attack On Unified Vision-language ModelsCompositional Image-Text Matching and Retrieval by Grounding EntitiesBeyond Modality Collapse: Representations Blending for Multimodal Dataset DistillationDistill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer DistillationRedemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal TriangulationCIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP
GeneralizationScaling Autoregressive Multi-Modal Models: Pretraining and Instruction
TuningCOCO-Counterfactuals: Automatically Constructed Counterfactual Examples
for Image-Text PairsEmbracing Language Inclusivity and Diversity in CLIP through Continual
Language LearningPVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image RecognitionRAVEN: Multitask Retrieval Augmented Vision-Language LearningRethinking Sparse Lexical Representations for Image Retrieval in the Age
of Rising Multi-Modal Large Language ModelsLinear Alignment of Vision-language Models for Image CaptioningGenerative Visual Question AnsweringTowards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute associationEmergent Open-Vocabulary Semantic Segmentation from Off-the-shelf
Vision-Language ModelsCL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
Knowledge TransferVLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object
Detection via Vision-Language ModelFeedback-based Modal Mutual Search for Attacking Vision-Language
Pre-training ModelsNearest Neighbor Normalization Improves Multimodal RetrievalText-Region Matching for Multi-Label Image Recognition with Missing
Labels