← authors · overview

Xizhou Zhu

20 papers · 394 citations

Most-cited papers

How Far Are We To GPT-4V? Closing The Gap To Commercial Multimodal Models With Open-source Suites
2024 · 339 citations
VL-LTR: Learning Class-wise Visual-linguistic Representation For Long-tailed Visual Recognition
2021 · 42 citations
Visionllm V2: An End-to-end Generalist Multimodal Large Language Model For Hundreds Of Vision-language Tasks
2024 · 5 citations
Synergen-vl: Towards Synergistic Image Understanding And Generation With Vision Experts And Token Folding
2024 · 4 citations
PVC: Progressive Visual Token Compression For Unified Image And Video Processing In Large Vision-language Models
2024 · 2 citations
Ghost In The Minecraft: Generally Capable Agents For Open-world Environments Via Large Language Models With Text-based Knowledge And Memory
2023
Zerogui: Automating Online GUI Learning At Zero Human Cost
2025
Mmbench-gui: Hierarchical Multi-platform Evaluation Framework For GUI Agents
2025
Mirothinker: Pushing The Performance Boundaries Of Open-source Research Agents Via Model, Context, And Interactive Scaling
2025
Collaborative Visual Navigation
2021

Topics

Visual Language 3D Vision Code Agents Image Generation Video Understanding Multi-Agent Object Detection Memory Uncategorized Evaluation