Abstract
The growing complexity of today’s enterprise networks is challenging traditional approaches to fault management. The proposed solution utilizes a multi-agent large language model (LLM) framework that autonomously detects, diagnoses, and resolves network faults. The framework includes four agents: an Ingestion Agent, a Diagnostic Agent, a Remediation Agent, and an Oversight Agent. Each agent uses an LLM with retrieval-augmented generation (RAG) capabilities to gain contextual knowledge about the network and vendor-specific knowledge to aid in performing each stage of the fault resolution process. The framework was evaluated using a dataset of network faults from three different vendors. The results of the evaluation demonstrate that the framework can autonomously resolve 91.4% of common faults, reducing the mean time to remediate faults by 68%. Additionally, the false-positive rate for the remediation of faults was less than 2.3%. Thus, these results show that the framework is able to autonomously resolve network faults with high reliability and safety.