Xizhou Zhu
20 papers · 394 citations
Most-cited papers
- How Far Are We To GPT-4V? Closing The Gap To Commercial Multimodal Models With Open-source Suites2024 · 339 citations
- VL-LTR: Learning Class-wise Visual-linguistic Representation For Long-tailed Visual Recognition2021 · 42 citations
- Visionllm V2: An End-to-end Generalist Multimodal Large Language Model For Hundreds Of Vision-language Tasks2024 · 5 citations
- Synergen-vl: Towards Synergistic Image Understanding And Generation With Vision Experts And Token Folding2024 · 4 citations
- PVC: Progressive Visual Token Compression For Unified Image And Video Processing In Large Vision-language Models2024 · 2 citations
- Ghost In The Minecraft: Generally Capable Agents For Open-world Environments Via Large Language Models With Text-based Knowledge And Memory2023
- Zerogui: Automating Online GUI Learning At Zero Human Cost2025
- Mmbench-gui: Hierarchical Multi-platform Evaluation Framework For GUI Agents2025
- Mirothinker: Pushing The Performance Boundaries Of Open-source Research Agents Via Model, Context, And Interactive Scaling2025
- Collaborative Visual Navigation2021
Topics