← authors · overview

Renrui Zhang

12 papers · 0 citations

Most-cited papers

Video-mme: The First-ever Comprehensive Evaluation Benchmark Of Multi-modal Llms In Video Analysis
2024 · 1125 citations
Tip-adapter: Training-free Adaption Of CLIP For Few-shot Classification
2022 · 306 citations
SPHINX: The Joint Mixing Of Weights, Tasks, And Visual Embeddings For Multi-modal Large Language Models
2023 · 288 citations
Point-bind & Point-llm: Aligning Point Cloud With Multi-modality For 3D Understanding, Generation, And Instruction Following
2023 · 213 citations
Manipllm: Embodied Multimodal Large Language Model For Object-centric Robotic Manipulation
2023 · 209 citations
Imagebind-llm: Multi-modality Instruction Tuning
2023 · 174 citations
Pointclip V2: Prompting CLIP And GPT For Powerful 3D Open-world Learning
2022 · 158 citations
Frozen CLIP Models Are Efficient Video Learners
2022 · 156 citations
CALIP: Zero-shot Enhancement Of CLIP With Parameter-free Attention
2022 · 91 citations
Can Language Understand Depth?
2022 · 57 citations
Delving Into RL For Image Generation With Cot: A Study On DPO Vs. GRPO
2025
Mint-cot: Enabling Interleaved Visual Tokens In Mathematical Chain-of-thought Reasoning
2025
Ac-dit: Adaptive Coordination Diffusion Transformer For Mobile Manipulation
2025
Unictokens: Boosting Personalized Understanding And Generation Via Unified Concept Tokens
2025

Topics

Vision-Language Model Architecture Training Techniques Vision-Language Models Fine-Tuning Uncategorized 3D Vision Visual Language Visual QA & Reasoning Evaluation