Scenarioclip: Pretrained Transferable Visual Language Models And Action-genome Dataset For Natural Scene Analysis
2025 Β· Advik Sinha, Saurabh Atreya, Aashutosh A, et al.
Abstract
Until recently, the general corpus of CLIP-type fundamental models has widely explored either the retrieval of short descriptions or the classification of objects in the scene as SINGLE-object image classification task. The same holds for retrieving the image embedding (image retrieval task) given a text prompt. However, real-world scene images exhibit rich compositional structure involving multiple objects and actions. The latest methods in the CLIP-based literature improve class-level discrimination by mining harder negative image-text pairs and by refining permanent text prompts, often using LLMs. However, these improvements remain confined to predefined class lists and do not explicitly model relational or compositional structure. PyramidCLIP partially addresses this gap by aligning global and local visual features, yet it still lacks explicit modeling of inter-object relations. Hence, to further leverage this aspect for scene analysis, the proposed ScenarioCLIP model accepts input
Authors
(none)
Tags
Stats
Related papers
- DGTRSD & DGTRS-CLIP: A Dual-granularity Remote Sensing Image-text Dataset And Vision Language Foundation Model For Alignment (2025)2.98
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-modality Representation (2024)2.26
- Compositional Image-text Matching And Retrieval By Grounding Entities (2025)0.60
- Lightclip: Learning Multi-level Interaction For Lightweight Vision-language Models (2023)0.00
- Clip-vip: Adapting Pre-trained Image-text Model To Video-language Representation Alignment (2022)5.42
- Tripletclip: Improving Compositional Reasoning Of CLIP Via Synthetic Vision-language Negatives (2024)4.52
- Contextclip: Contextual Alignment Of Image-text Pairs On CLIP Visual Representations (2022)5.84
- Videoclip-xl: Advancing Long Description Understanding For Video CLIP Models (2024)8.35