MSCOCO
Emerging57papers using it
2022first seen
MSCOCO is a dataset that contains images paired with descriptive captions, used to evaluate multimodal image-text retrieval and understanding tasks.
Papers using MSCOCO (57)
- Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal ReasoningCODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text RetrievalSmartclip: Modular Vision-language Alignment With Identification GuaranteesRobust Multimodal Learning Via Entropy-gated Contrastive FusionOverthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language ModelsMM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail DataAnatomy of a Lie: A Multi-Stage Diagnostic Framework for Tracing Hallucinations in Vision-Language ModelsGoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image PretrainingORIC: Benchmarking Object Recognition Under Contextual Incongruity In Large Vision-language ModelsSEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignmentCross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain ModelingLanguage-Guided Invariance Probing of Vision-Language ModelsCure or Poison? Embedding Instructions Visually Alters Hallucination in Vision-Language ModelsExplaining Similarity in Vision-Language Encoders with Weighted Banzhaf InteractionsConcept Regions Matter: Benchmarking CLIP With A New Cluster-importance ApproachCoco-urdu: A Large-scale Urdu Image-caption Dataset With Multimodal Quality EstimationSpec-llava: Accelerating Vision-language Models With Dynamic Tree-based Speculative DecodingFrom Pixels And Words To Waves: A Unified Framework For Spectral Dictionary VllmsLeveraging Vision-language Pre-training For Human Activity Recognition In Still ImagesMining Contextualized Visual Associations From Images For Creativity UnderstandingA Good CREPE Needs More Than Just Sugar: Investigating Biases In Compositional Vision-language BenchmarksOne Object, Multiple Lies: A Benchmark For Cross-task Adversarial Attack On Unified Vision-language ModelsCompositional Image-Text Matching and Retrieval by Grounding EntitiesBeyond Modality Collapse: Representations Blending for Multimodal Dataset DistillationDistill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer DistillationRedemption Score: A Multi-Modal Evaluation Framework for Image Captioning via Distributional, Perceptual, and Linguistic Signal TriangulationCIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP
GeneralizationONE-PEACE: Exploring One General Representation Model Toward Unlimited
ModalitiesScaling Autoregressive Multi-Modal Models: Pretraining and Instruction
TuningCross-Modal Adapter for Vision-Language RetrievalA Frustratingly Simple Approach for End-to-End Image CaptioningMulti-modal Pre-training for Medical Vision-language Understanding and
Generation: An Empirical Study with A New BenchmarkUncurated Image-Text Datasets: Shedding Light on Demographic BiasZero-shot Image Captioning by Anchor-augmented Vision-Language Space
AlignmentCOCO-Counterfactuals: Automatically Constructed Counterfactual Examples
for Image-Text PairsMultimodal Data Augmentation for Image Captioning using Diffusion ModelsMulti-Modal Few-Shot Temporal Action DetectionPlug-and-Play Regulators for Image-Text MatchingStructure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal
Structured RepresentationsEmbracing Language Inclusivity and Diversity in CLIP through Continual
Language LearningPVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image RecognitionRAVEN: Multitask Retrieval Augmented Vision-Language LearningRethinking Sparse Lexical Representations for Image Retrieval in the Age
of Rising Multi-Modal Large Language ModelsALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and RetrievalLearning by Hallucinating: Vision-Language Pre-training with Weak
SupervisionStacked Cross-modal Feature Consolidation Attention Networks for Image
CaptioningMAGVLT: Masked Generative Vision-and-Language TransformerLinear Alignment of Vision-language Models for Image CaptioningGenerative Visual Question AnsweringTowards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute associationEmergent Open-Vocabulary Semantic Segmentation from Off-the-shelf
Vision-Language ModelsCL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual
Knowledge TransferVLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object
Detection via Vision-Language ModelFeedback-based Modal Mutual Search for Attacking Vision-Language
Pre-training ModelsNearest Neighbor Normalization Improves Multimodal RetrievalLearnable Pillar-based Re-ranking for Image-Text RetrievalText-Region Matching for Multi-Label Image Recognition with Missing
Labels