Abstract
While the field of natural language to SQL(NL2SQL) has made significant advancements in translating natural language instructions into executable SQL scripts for data querying and processing, achieving full automation within the broader data science pipeline–encompassing data querying, analysis, visualization, and reporting–remains a complex challenge. This study introduces SageCopilot, an advanced, industry-grade system that automates the data science pipeline by integrating Large Language Models (LLMs), Autonomous Agents (AutoAgents), and Language User Interfaces (LUIs). Designed with a two-phase architecture, SageCopilot uses an offline phase to generate high-quality demonstrations supporting In-Context Learning (ICL), which powers the online phase to transform user inputs into executable scripts for database queries, analysis, and visualization tasks. Leveraging specialized components such as NL2SQL, Text2Analyze, and Text2Viz, as well as chain-of-thought prompting for multi-turn interactions, SageCopilot achieves superior end-to-end automation. Rigorous experimentation with real-world datasets demonstrates the system’s ability to minimize human intervention while ensuring correctness and user-friendly operation.