Automated incident response with AI agents on Kubernetes
TL;DR - Three agents investigate a production alert, diagnose root cause, and notify your team on Slack. Uses mock MCP servers for PagerDuty, Grafana, and Slack - all running locally. Total time: under 3 minutes. Total API cost: $0.00. All files are in the examples directory.
This showed up in our #incidents channel at 02:52 UTC:
Service: payments-api Root Cause: PostgreSQL primary server failure leading to database connection issues and exhausted connection pool Severity: High Remediation:
- Promote the standby PostgreSQL replica to primary
- Restart the payments-api deployment
- Clear connection pool caches
Nobody on the team wrote that. An agent pipeline did. It pulled logs from Grafana,
found 312 error entries pointing to postgres-primary:5432 - connection refused,
correlated the spike with a failover event at 02:47, and posted the summary.
The PagerDuty alert got acknowledged automatically.
I wanted to see if I could wire this up with kubeswarm - three agents, each with different tool access, where the one reading logs can't post to Slack and the one posting to Slack can't read logs. Turns out you can, and it's not that much YAML.
Here's how I built it.
The setup
Three agents, one pipeline. I split it by access level on purpose - I don't want a single agent that can both read production data and take actions on it.
| Agent | Role | MCP Tools | What it does |
|---|---|---|---|
| Investigator | Gather evidence | Grafana (logs, metrics), PagerDuty (read) | Queries logs and metrics around the alert |
| Diagnostician | Analyze | None | Reads the evidence, identifies root cause |
| Notifier | Communicate | Slack (write), PagerDuty (acknowledge) | Posts findings, acks the alert |
The diagnostician is intentionally tool-less. It gets the investigator's report, figures out what went wrong, and passes that to the notifier. The agent that decides what happened should never be the same agent that has write access.
Prerequisites
You need a Kubernetes cluster with kubeswarm installed (quick-start guide) and Ollama running:
ollama pull qwen2.5:7b
Mock MCP servers
Obviously I'm not going to connect this to real PagerDuty and Grafana for a blog post. So I wrote three tiny mock MCP servers in Go - about 80 lines each, no dependencies outside stdlib. They return realistic canned data over HTTP.
The mock Grafana returns error logs with connection refused to postgres-primary:5432
and a metrics spike at 02:47 UTC. Classic postgres failover scenario.
kubectl apply -f namespace.yaml
kubectl apply -f ollama-secret.yaml
kubectl apply -f mock-servers.yaml
kubectl get pods -n incident-response
NAME READY STATUS AGE
mock-grafana-7b4f8d6c5-x2k9p 1/1 Running 5s
mock-pagerduty-5c9d8e7f4-m3n7q 1/1 Running 5s
mock-slack-6a8b9c0d1-r4s6t 1/1 Running 5s
Source code for the mocks is in the examples directory if you want to look - they're straightforward.
The agents
Investigator - read-only access to observability:
# investigator-agent.yaml
spec:
model: qwen2.5:7b
prompt:
inline: |
You are an SRE investigator. When you receive a PagerDuty incident,
gather evidence from logs and metrics. Output a structured JSON report
with incident_id, error_pattern, timeline, and raw_evidence.
tools:
mcp:
- name: pagerduty
url: "http://mock-pagerduty.incident-response.svc:8080"
- name: grafana
url: "http://mock-grafana.incident-response.svc:8080"
guardrails:
limits:
tokensPerCall: 4000
timeoutSeconds: 90
tools:
allow:
- "pagerduty/get_incident"
- "grafana/*"
Note the tools.allow list. It can call grafana/query_logs and grafana/query_metrics
but if I added a slack MCP server here, the allow list would block it. I like that
the access control is declarative and visible in the YAML.
Diagnostician - no tools, just reasoning:
# diagnostician-agent.yaml
spec:
model: qwen2.5:7b
prompt:
inline: |
You are a senior SRE diagnostician. Identify the root cause from the
investigation report. Suggest a specific remediation command, not
vague advice. Output JSON with root_cause, severity, remediation,
and confidence.
guardrails:
limits:
tokensPerCall: 3000
timeoutSeconds: 120
No tools section at all. I went back and forth on whether this agent should have
access to anything. Decided no - the separation between "reading" and "deciding"
is the whole point.
Notifier - write access to Slack and PagerDuty:
# notifier-agent.yaml
spec:
model: qwen2.5:7b
prompt:
inline: |
You are an incident communications agent.
You MUST call post_message with channel "#incidents" and a text
summary. Then call acknowledge_incident with the incident_id.
tools:
mcp:
- name: slack
url: "http://mock-slack.incident-response.svc:8080"
- name: pagerduty
url: "http://mock-pagerduty.incident-response.svc:8080"
guardrails:
limits:
tokensPerCall: 3000
timeoutSeconds: 90
tools:
allow:
- "slack/post_message"
- "pagerduty/acknowledge_incident"
The MUST call in the prompt is not elegant, but with a 7B model you sometimes
need to be blunt. Bigger models follow the instructions without the shouting.
kubectl apply -f investigator-agent.yaml
kubectl apply -f diagnostician-agent.yaml
kubectl apply -f notifier-agent.yaml
kubectl get swarmagents -n incident-response
NAME MODEL REPLICAS READY AGE
incident-investigator qwen2.5:7b 1 1 5s
incident-diagnostician qwen2.5:7b 1 1 5s
incident-notifier qwen2.5:7b 1 1 5s
The pipeline
A SwarmTeam wires the three agents into a DAG. Diagnostician waits for investigator, notifier waits for diagnostician.
# incident-team.yaml
spec:
roles:
- name: investigator
swarmAgent: incident-investigator
- name: diagnostician
swarmAgent: incident-diagnostician
- name: notifier
swarmAgent: incident-notifier
pipeline:
- role: investigator
inputs:
alert: "{{ .input.alert }}"
- role: diagnostician
dependsOn: [investigator]
inputs:
investigation: "{{ .steps.investigator.output }}"
alert: "{{ .input.alert }}"
- role: notifier
dependsOn: [diagnostician]
inputs:
investigation: "{{ .steps.investigator.output }}"
diagnosis: "{{ .steps.diagnostician.output }}"
The notifier gets both the investigation and the diagnosis as input, so it has full context when writing the Slack message.
kubectl apply -f incident-team.yaml
Running it
I fed it a simulated PagerDuty alert - high error rate on the payments API:
# sample-incident.yaml
spec:
teamRef: incident-responder
input:
alert: |
PagerDuty Incident P-48291: High error rate on payments-api
Severity: High
Service: payments-api
Triggered: 2026-04-23T02:45:00Z
Description: Error rate on payments-api exceeded 5% threshold.
Current rate: 12.3% 5xx responses over the last 5 minutes.
Alert count: 47 alerts in the last 8 minutes
kubectl apply -f sample-incident.yaml
kubectl get swarmrun incident-001 -n incident-response -w
NAME PHASE AGE
incident-001 Pending 0s
incident-001 Running 2s
incident-001 Succeeded 2m18s
Here's what each agent actually produced.
Investigator - it called all three MCP tools (get_incident, query_logs,
query_metrics) and pulled together this report:
{
"incident_id": "P-48291",
"service": "payments-api",
"error_pattern": "Connection timeout and pool exhaustion errors involving PostgreSQL primary server failure",
"timeline": "Incident started at 02:47 UTC when a postgres failover occurred, leading to an error rate spike from 0.1% to 12.3% by 02:50 UTC",
"metrics_summary": "Error rate on payments-api exceeded baseline starting at 02:47 UTC during a PostgreSQL failover event, peaking at 12.3%"
}
I checked the audit logs - every data point traces back to the mock Grafana response. No hallucination.
Diagnostician - correctly identified the root cause:
{
"root_cause": "PostgreSQL primary server failure leading to database connection issues and exhausted connection pool",
"severity": "high",
"blast_radius": "payments-api service experiencing high error rates (12.3%)",
"remediation": "Promote the standby PostgreSQL replica to primary and restart the payments-api deployment to clear connection pool caches",
"runbook": "",
"confidence": "high"
}
The empty runbook field is correct - none was provided in the evidence, and I
explicitly told the model not to invent URLs. (It tried to on an earlier run.
Small models love making up links.)
Notifier - the mock Slack server logged this:
kubectl logs -n incident-response deploy/mock-slack
========================================
SLACK MESSAGE to #incidents
========================================
*Incident Summary*
**Service:** payments-api
**Root Cause:** PostgreSQL primary server failure leading to
database connection issues and exhausted connection pool
**Severity:** High
**Blast Radius:** payments-api service experiencing high error rates (12.3%)
*Remediation Steps:*
- Promote the standby PostgreSQL replica to primary
- Restart the payments-api deployment
- Clear connection pool caches
- Monitor for any persistent issues
========================================
And the PagerDuty mock logged the acknowledgment. Both tools called, both actions completed.
Policy
Once this works, you want to make sure nobody deploys a rogue agent with shell access in this namespace:
# incident-policy.yaml
spec:
enforcementMode: Enforce
limits:
maxDailyTokens: 200000
maxTokensPerCall: 3000
maxTimeoutSeconds: 600
tools:
deny:
- "shell/*"
- "filesystem/*"
models:
allowed:
- "qwen*"
- "llama*"
| Rule | Why |
|---|---|
tools.deny: shell/*, filesystem/* | Agents can use MCP tools but never execute commands or write files |
models.allowed: qwen*, llama* | Only approved local models - no surprise API bills |
maxDailyTokens: 200000 | 200K tokens/day ceiling - enough for ~20 incidents |
maxTokensPerCall: 3000 | No single LLM call burns more than 3K tokens |
Going further
To connect real services, swap the mock URLs:
tools:
mcp:
- name: pagerduty
url: "http://pagerduty-mcp.incident-response.svc:8080"
auth:
type: bearer
secretRef:
name: pagerduty-api-key
To trigger automatically from PagerDuty webhooks instead of kubectl apply,
use a SwarmEvent:
spec:
source:
type: webhook
targets:
- team: incident-responder
inputs:
alert: "{{ .trigger.body.incident.title }}: {{ .trigger.body.incident.description }}"
concurrencyPolicy: Allow
The operator generates a webhook URL. Point PagerDuty at it and every incident fires the pipeline.
Numbers
| Metric | Value |
|---|---|
| Agents | 3 (investigator + diagnostician + notifier) |
| Model | qwen2.5:7b (any model works) |
| Time per incident | ~2.5 minutes |
| Tokens per incident | ~10,000 |
| Cost per incident | $0.00 (local model) |
| Tools | PagerDuty, Grafana, Slack via MCP |
| Guardrails | Per-agent tool allowlists, namespace policy, token budgets |
The 2.5 minutes is mostly Ollama thinking on my laptop. With a faster model or GPU, this would be well under a minute.
Cleanup
kubectl delete namespace incident-response
All the files are in the cookbook. The docs have more on MCP integration and SwarmEvents.
kubeswarm is an open-source Kubernetes operator for managing AI agents. Docs | Cookbook | GitHub