Zhe Gan
15 papers · 1650 citations
Most-cited papers
- Ferret: Refer And Ground Anything Anywhere At Any Granularity2023 · 503 citations
- Less Is More: Clipbert For Video-and-language Learning Via Sparse Sampling2021 · 468 citations
- An Empirical Study Of Training End-to-end Vision-and-language Transformers2021 · 273 citations
- MM1: Methods, Analysis & Insights From Multimodal LLM Pre-training2024 · 261 citations
- Ferret-ui: Grounded Mobile UI Understanding With Multimodal Llms2024 · 174 citations
- Guiding Instruction-based Image Editing Via Multimodal Large Language Models2023 · 172 citations
- Scaling Up Vision-language Pre-training For Image Captioning2021 · 157 citations
- Slowfast-llava: A Strong Training-free Baseline For Video Large Language Models2024 · 120 citations
- Injecting Semantic Concepts Into End-to-end Image Captioning2021 · 113 citations
- Tactical Rewind: Self-correction Via Backtracking In Vision-and-language Navigation2019 · 95 citations
- LAVENDER: Unifying Video-language Understanding As Masked Language Modeling2022 · 50 citations
- Coarse-to-fine Vision-language Pre-training With Fusion In The Backbone2022 · 10 citations
- MOFI: Learning Image Representations From Noisy Entity Annotated Images2023
- Large-scale Adversarial Training For Vision-and-language Representation Learning2020
- UFO: A Unified Transformer For Vision-language Representation Learning2021
Topics