VQA
Canonical49papers using it
2023first seen
The VQA (Visual Question Answering) dataset contains images paired with questions and answers, and it is used to evaluate the ability of models to understand and reason about visual content in conjunction with natural language queries.
Papers using VQA (49)
- Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual EvidenceHyper-ICL: Attention Calibration with Hyperbolic Anchor Distillation for Multimodal In-Context LearningEnsemHalDet: Robust VLM Hallucination Detection via Ensemble of Internal State DetectorsAIM: Asymmetric Information Masking for Visual Question Answering Continual LearningDOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf ModelsRS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images UnderstandingDEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language ModelsParallel In-context Learning for Large Vision Language ModelsHidden Clones: Exposing and Fixing Family Bias in Vision-Language Model EnsemblesN\"uwa: Mending the Spatial Integrity Torn by VLM Token PruningVOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question AnsweringGuardAlign: Test-time Safety Alignment in Multimodal Large Language ModelsText-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMsHybridToken-VLM: Hybrid Token Compression for Vision-Language ModelsEfficient Vision-Language Reasoning via Adaptive Token PruningOMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language ModelsDraft and Refine with Visual ExpertsLooking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language ModellingHuman Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question AnsweringAttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention AnchorsMV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question AnsweringINTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance SamplingFrom Pixels And Words To Waves: A Unified Framework For Spectral Dictionary VllmsInvestigating Redundancy In Multimodal Large Language Models With Multiple Vision EncodersOptmerge: Unifying Multimodal LLM Capabilities And Modalities Via Model MergingPostAlign: Multimodal Grounding as a Corrective Lens for MLLMsCalibrating Uncertainty Quantification of Multi-Modal LLMs using
GroundingVISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information MaximizationFRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question AnsweringElevating Visual Question Answering through Implicitly Learned Reasoning
Pathways in LVLMsSemantic-Clipping: Efficient Vision-Language Modeling with
Semantic-Guidedd Visual SelectionAsk and Remember: A Questions-Only Replay Strategy for Continual Visual Question AnsweringScaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image AnalysisUniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language UnderstandingMMBench: Is Your Multi-modal Model an All-around Player?How to Configure Good In-Context Sequence for Visual Question AnsweringMitigating Dialogue Hallucination for Large Vision Language Models via
Adversarial Instruction TuningLXMERT Model Compression for Visual Question AnsweringGenerative Visual Question AnsweringTowards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
ModelsTowards Perceiving Small Visual Details in Zero-shot Visual Question
Answering with Multimodal LLMsImproving Vision-and-Language Reasoning via Spatial Relations ModelingVQAttack: Transferable Adversarial Attacks on Visual Question Answering
via Pre-trained ModelsMulti-Modal Hallucination Control by Visual Information GroundingSelectively Answering Visual QuestionsEnhancing Instruction-Following Capability of Visual-Language Models by
Reducing Image RedundancyMAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at ScaleOptimizing Vision-Language Interactions Through Decoder-Only ModelsLeveraging Retrieval-Augmented Tags for Large Vision-Language
Understanding in Complex Scenes