Abstract
This paper introduces LOOK TRACK VISION, a multimodal Human-Computer Interaction (HCI) system that can be used to control computers hands-free with the help of gaze direction, blink gestures, and voice commands. The system is built based on a regular webcam and a Convolutional Neural Network (CNN) model to classify gaze directions (LEFT, RIGHT, UP, DOWN, CENTER), with MediaPipe facial landmarks to estimate blink and head-pose. Multimodal interface is an interface that incorporates gaze-based cursor control, blink-based clicking, voice-to-text conversion, and an assistive Communicator module. The solution does not require any specialized hardware to operate, so it is low-cost and portable, and it can be used by users with motor impairments. The proposed CNN was experimentally analyzed and achieved approximately 95% accuracy in webcam-based inputs under various lighting conditions. The elaborated desktop application also confirms the usability, robustness, and potential of the system as a comprehensive HCI solution.