Flickr30k
Canonical23papers using it
2023first seen
Flickr30k is a large-scale dataset containing 30,000 images, each paired with five corresponding textual descriptions, used to evaluate cross-modal retrieval and the alignment between visual and textual information.
Papers using Flickr30k (23)
- Intra-Modal Neighbors Never Lie: Rectifying Inter-Modal Noisy Correspondence via Graph-Based Intra-Modal ReasoningMM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail DataLogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual VerificationCoT4Det: A Chain-of-Thought Framework for Perception-Oriented Vision-Language TasksSEPS: Semantic-enhanced Patch Slimming Framework for fine-grained cross-modal alignmentCross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain ModelingA Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance FeedbackVision-Free Retrieval: Rethinking Multimodal Search with Textual Scene DescriptionsExtracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable KnowledgeRate-distortion Limits For Multimodal Retrieval: Theory, Optimal Codes, And Finite-sample GuaranteesParameter Efficient Multimodal Instruction Tuning For Romanian Vision Language ModelsCovmatch: Cross-covariance Guided Multimodal Dataset Distillation With Trainable Text EncoderRobust Vision-language Models Via Tensor Decomposition: A Defense Against Adversarial AttacksCompositional Image-Text Matching and Retrieval by Grounding EntitiesBeyond Modality Collapse: Representations Blending for Multimodal Dataset DistillationDistill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer DistillationCIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP
GeneralizationCLIP-PING: Boosting Lightweight Vision-Language Models with Proximus
Intrinsic Neighbors GuidanceLinear Alignment of Vision-language Models for Image CaptioningTowards reporting bias in visual-language datasets: bimodal augmentation
by decoupling object-attribute associationFeedback-based Modal Mutual Search for Attacking Vision-Language
Pre-training ModelsNearest Neighbor Normalization Improves Multimodal RetrievalTowards Fast and Accurate Image-Text Retrieval with Self-Supervised
Fine-Grained Alignment