A Multi-Agent Agentic Framework for Autonomous Cloud Infrastructure Monitoring, Anomaly Detection, and Self-Healing

Abstract

Modern cloud infrastructure management demands continuous vigilance across distributed services, log streams, and resource metrics at a scale that overwhelms human operators and rule-based systems alike. This paper presents a novel agentic, multi-tenant platform that deploys seven specialized AI agents—Log Intelligence, Crash Diagnostic, Resource Opti-mization, Anomaly Detection, Recovery, Recommendation, and Cost Optimization—coordinated by a central orchestrator. Each agent is backed by large language models (LLMs) accessed through LangChain, supporting pluggable providers including Google Gemini, OpenAI GPT, Anthropic Claude, and Groq. Asynchronous inter-agent communication is realized via Rab-bitMQ message queues, real-time state propagation through Redis Pub/Sub, and persistent storage on MongoDB. A statistical logfiltering pipeline reduces raw CloudWatch log volume by up to 98.8% before LLM inference, making the system economi-cally viable at production scale. Confidence-gated decision logic governs autonomous recovery actions: high-confidence diagnoses trigger immediate auto-healing, while low-confidence scenarios escalate to collaborative multi-agent analysis or human review. Experimental results demonstrate 93.9% anomaly detection pre-cision, 85% recovery action accuracy, and a full-pipeline median latency of 20.3 seconds from log ingest to completed remediation, establishing our framework as a practical foundation for next-generation AIOps platforms.

Related papers

Ranked by semantic similarity — how closely each paper's abstract matches this one (100% = near-identical topic).