GQA
Canonical23papers using it
2023first seen
GQA is a benchmark that contains a diverse set of visual question answering tasks used to evaluate the performance of models in understanding and reasoning about images in conjunction with natural language.
Papers using GQA (23)
- Hierarchical Pre-Training of Vision Encoders with Large Language ModelsAIM: Asymmetric Information Masking for Visual Question Answering Continual LearningHidden Clones: Exposing and Fixing Family Bias in Vision-Language Model EnsemblesVOILA: Value-of-Information Guided Fidelity Selection for Cost-Aware Multimodal Question AnsweringHow Vision Becomes Language: A Layer-wise Information-Theoretic Analysis of Multimodal ReasoningText-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMsHybridToken-VLM: Hybrid Token Compression for Vision-Language ModelsEfficient Vision-Language Reasoning via Adaptive Token PruningMV-CoRe: Multimodal Visual-Conceptual Reasoning for Complex Visual Question AnsweringMPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language ModelsHierarchical Contextual Grounding LVLM: Enhancing Fine-Grained Visual-Language Understanding with Robust GroundingByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free WayASCD: Attention-Steerable Contrastive Decoding for Reducing Hallucination in MLLMConstructive Distortion: Improving Mllms With Attention-guided Image WarpingTest-time Warmup For Multimodal Large Language ModelsMulti-Sourced Compositional Generalization in Visual Question AnsweringII-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in
Visual Question AnsweringTowards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language
ModelsLCV2: An Efficient Pretraining-Free Framework for Grounded Visual
Question AnsweringEfficient Large Multi-modal Models via Visual Context CompressionSADL: An Effective In-Context Learning Method for Compositional Visual
QAEnhancing Instruction-Following Capability of Visual-Language Models by
Reducing Image RedundancyLeveraging Retrieval-Augmented Tags for Large Vision-Language
Understanding in Complex Scenes