← all papers · overview

Starvla-\(α\): Reducing Complexity In Vision-language-action Systems

·2026

Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-α\alpha, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-α\alpha deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to a

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).