Bridging Speech and Perception: A Multimodal AI Framework for Autonomous Robots

Abstract

This work presents an integrated framework for human-robot interaction that combines natural language processing, computer vision, and autonomous navigation to enable a mobile robot to perform tasks through voice commands. The system employs Whisper for speech-to-text conversion, Ollama for language understanding, YOLO for real-time object detection, and ROS2 for robotic control. A proof-of-concept on the Agilex LIMO robot demonstrates the ability to interpret human instructions, detect relevant objects, and execute corresponding movements such as rotating, navigating, and approaching targets. The results illustrate how perceptual intelligence and robotic control can be effectively integrated to achieve intuitive interaction between humans and robots. This approach contributes toward realizing human intent as executable robotic behavior, with potential applications in service robotics, assistive systems, and autonomous automation.

Abstract

Related papers