How I Built an AI Agent That Actually Manages My Infrastructure
Most "AI-powered" DevOps tools do one thing really well: send you a Slack message.
"OH no, your database is at 90% CPU." Thanks. I can see that on Grafana. What I can't do is fix it at 2am when I'm asleep.
That's the gap Sanskriti Labs was built to fill. Not another dashboard. Not another alert. An AI agent that actually makes decisions and executes — the same decisions a senior DevOps engineer would make, just without the coffee and the attitude.
Here's what I learned building it.
The Problem With Current DevOps AI
Let me be specific about what was broken.
Traditional monitoring gives you data. You still have to:
That's a full cognitive loop that assumes a human is available and awake. For a system running 24/7, that's a terrible architectural decision.
What I wanted: an AI that could handle the loop autonomously. See a problem → understand it → act → verify the fix worked. No human in the loop.
What Sanskriti Labs Actually Does
The system has four core capabilities that run simultaneously:
1. Autonomous Container Management
The AI monitors Kubernetes pod health, resource utilization, and scheduling patterns. When it detects a pod that's going to hit memory limits in the next 30 minutes (predicted via the ML model, not just reacted to), it pre-emptively evicts less critical workloads, reschedules pods to nodes with available memory, and scales the deployment before users see any degradation.
It does this without a human typing a single kubectl command.
2. Database Query Optimization
PostgreSQL query performance degrades as data grows. The AI analyzes slow query logs, identifies queries that are doing full table scans, looks at the query execution plan, and either suggests or automatically creates missing indexes.
But here's the thing — it doesn't just create indexes blindly. It runs EXPLAIN ANALYZE on the proposed index, verifies the query planner would actually use it, and only applies it if the estimated cost drops significantly. Then it monitors query performance for 48 hours post-change to confirm the improvement stuck.
3. Resource Scaling
Kubernetes HPA (Horizontal Pod Autoscaler) is reactive. It scales when utilization is already high. Sanskriti Labs uses a predictive model trained on historical traffic patterns — it sees the traffic spike before it arrives and scales proactively.
For a service that gets traffic spikes from Twitter or Hacker News, this means the system is already scaled before the wave hits, not drowning in requests while it scales up.
4. Natural Language Command Interface
The crown jewel: you can talk to it.
"Can you check if any services are running low on memory?"
"What's our database P99 query latency over the last week?"
"Scale the API service to 10 replicas before the product launch at 3pm"
It interprets the intent, translates it to actual infrastructure actions, and executes. No more forgetting which CLI flag does what.
The Tech Stack (And Why I Chose Each Piece)
The stack is Python-heavy because that's where the ML ecosystem lives:
TensorFlow for the predictive scaling model. I tried a few approaches — ARIMA for time series, Prophet for trend detection — but a simple LSTM neural network outperformed both for multi-step traffic prediction. The signal has multiple periodicities (daily, weekly, plus event-driven spikes) and LSTMs handle that better.
FastAPI for the API layer. Async, Pydantic for validation, OpenAPI docs generated automatically. The AI needs to talk to multiple systems and FastAPI handles the coordination layer cleanly.
Docker + Kubernetes for orchestration. Obviously. The AI agent itself runs as a pod in the cluster, which means it has the same access and permissions as any other workload — no special privileges, just the same Kubernetes API everyone else uses.
PostgreSQL + Redis for data storage. PostgreSQL for structured operational data (metrics, logs, decisions made). Redis for the caching layer — the AI needs fast access to recent metrics to make decisions in real time.
React for the dashboard. Not because it's the best for this (it probably isn't), but because it's what I know and the dashboard is an internal tool, not a product.
The Results (The Numbers That Matter)
Here's what the system delivers in production:
The most surprising result: I took a vacation for the first time in two years and the system ran without calling me once.
What I Got Wrong (So You Don't Have To)
Over-engineering the ML models first. I spent three weeks tuning the LSTM when a simple threshold model would have been good enough for 80% of the use cases. Start simple. Add ML complexity only when the simple approach fails.
Not planning for failure modes early enough. What happens when the AI makes a bad decision? I had to retrofit a human-in-the-loop approval system for critical actions. Plan for this from day one.
Ignoring the cost of inference. Running a TensorFlow model for every decision sounds fine until you see the AWS bill. Most decisions in infrastructure are rule-based. Save the ML model for the decisions that actually need it.
How to Build Something Similar
If you want to build this for your own infrastructure, here's the order I'd do it in:
Week 1-2: Data collection. You can't automate what you can't measure. Get Prometheus exporting metrics. Get structured logs into a searchable store. Build the data layer first.
Week 3-4: Simple alerting with ML. Start with threshold-based alerts that are already too late (reactive). Then layer in the ML model to predict before the threshold is hit.
Week 5-6: Autonomous actions. Pick one action (restart a failing pod, create an index, scale a deployment) and make the AI do it autonomously. Monitor every decision it makes.
Week 7+: The command interface. Build the natural language layer last. It's the most impressive part but the least valuable in isolation.
The key insight: you don't need a sophisticated AI to get 80% of the value. A well-trained threshold model plus a few well-crafted automation rules will get you further faster than building a general AI agent.
The Bigger Picture
The goal isn't to replace DevOps engineers. It's to offload the 2am pages that drain energy and focus. The system handles the predictable problems. You handle the interesting ones.
When I look at Sanskriti Labs, I don't see a replacement for my brain. I see a teammate that never sleeps, never gets tired, and never makes mistakes from fatigue.
That's the future of DevOps. Not more dashboards. More AI teammates.
Related Posts
Built something similar or want to talk through the architecture? Get in touch.