← Back to blog

Automated incident response with AI agents on Kubernetes

incident-responsemulti-agentmcp

TL;DR - Three agents investigate a production alert, diagnose root cause, and notify your team on Slack. Uses mock MCP servers for PagerDuty, Grafana, and Slack - all running locally. Total time: under 3 minutes. Total API cost: $0.00. All files are in the examples directory.


This showed up in our #incidents channel at 02:52 UTC:

Service: payments-api Root Cause: PostgreSQL primary server failure leading to database connection issues and exhausted connection pool Severity: High Remediation:

  • Promote the standby PostgreSQL replica to primary
  • Restart the payments-api deployment
  • Clear connection pool caches

Nobody on the team wrote that. An agent pipeline did. It pulled logs from Grafana, found 312 error entries pointing to postgres-primary:5432 - connection refused, correlated the spike with a failover event at 02:47, and posted the summary. The PagerDuty alert got acknowledged automatically.

I wanted to see if I could wire this up with kubeswarm - three agents, each with different tool access, where the one reading logs can't post to Slack and the one posting to Slack can't read logs. Turns out you can, and it's not that much YAML.

Here's how I built it.

The setup

Three agents, one pipeline. I split it by access level on purpose - I don't want a single agent that can both read production data and take actions on it.

AgentRoleMCP ToolsWhat it does
InvestigatorGather evidenceGrafana (logs, metrics), PagerDuty (read)Queries logs and metrics around the alert
DiagnosticianAnalyzeNoneReads the evidence, identifies root cause
NotifierCommunicateSlack (write), PagerDuty (acknowledge)Posts findings, acks the alert

The diagnostician is intentionally tool-less. It gets the investigator's report, figures out what went wrong, and passes that to the notifier. The agent that decides what happened should never be the same agent that has write access.

Prerequisites

You need a Kubernetes cluster with kubeswarm installed (quick-start guide) and Ollama running:

ollama pull qwen2.5:7b

Mock MCP servers

Obviously I'm not going to connect this to real PagerDuty and Grafana for a blog post. So I wrote three tiny mock MCP servers in Go - about 80 lines each, no dependencies outside stdlib. They return realistic canned data over HTTP.

The mock Grafana returns error logs with connection refused to postgres-primary:5432 and a metrics spike at 02:47 UTC. Classic postgres failover scenario.

kubectl apply -f namespace.yaml
kubectl apply -f ollama-secret.yaml
kubectl apply -f mock-servers.yaml
kubectl get pods -n incident-response
NAME                              READY   STATUS    AGE
mock-grafana-7b4f8d6c5-x2k9p     1/1     Running   5s
mock-pagerduty-5c9d8e7f4-m3n7q   1/1     Running   5s
mock-slack-6a8b9c0d1-r4s6t       1/1     Running   5s

Source code for the mocks is in the examples directory if you want to look - they're straightforward.

The agents

Investigator - read-only access to observability:

# investigator-agent.yaml
spec:
  model: qwen2.5:7b
  prompt:
    inline: |
      You are an SRE investigator. When you receive a PagerDuty incident,
      gather evidence from logs and metrics. Output a structured JSON report
      with incident_id, error_pattern, timeline, and raw_evidence.
  tools:
    mcp:
      - name: pagerduty
        url: "http://mock-pagerduty.incident-response.svc:8080"
      - name: grafana
        url: "http://mock-grafana.incident-response.svc:8080"
  guardrails:
    limits:
      tokensPerCall: 4000
      timeoutSeconds: 90
    tools:
      allow:
        - "pagerduty/get_incident"
        - "grafana/*"

Note the tools.allow list. It can call grafana/query_logs and grafana/query_metrics but if I added a slack MCP server here, the allow list would block it. I like that the access control is declarative and visible in the YAML.

Diagnostician - no tools, just reasoning:

# diagnostician-agent.yaml
spec:
  model: qwen2.5:7b
  prompt:
    inline: |
      You are a senior SRE diagnostician. Identify the root cause from the
      investigation report. Suggest a specific remediation command, not
      vague advice. Output JSON with root_cause, severity, remediation,
      and confidence.
  guardrails:
    limits:
      tokensPerCall: 3000
      timeoutSeconds: 120

No tools section at all. I went back and forth on whether this agent should have access to anything. Decided no - the separation between "reading" and "deciding" is the whole point.

Notifier - write access to Slack and PagerDuty:

# notifier-agent.yaml
spec:
  model: qwen2.5:7b
  prompt:
    inline: |
      You are an incident communications agent.
      You MUST call post_message with channel "#incidents" and a text
      summary. Then call acknowledge_incident with the incident_id.
  tools:
    mcp:
      - name: slack
        url: "http://mock-slack.incident-response.svc:8080"
      - name: pagerduty
        url: "http://mock-pagerduty.incident-response.svc:8080"
  guardrails:
    limits:
      tokensPerCall: 3000
      timeoutSeconds: 90
    tools:
      allow:
        - "slack/post_message"
        - "pagerduty/acknowledge_incident"

The MUST call in the prompt is not elegant, but with a 7B model you sometimes need to be blunt. Bigger models follow the instructions without the shouting.

kubectl apply -f investigator-agent.yaml
kubectl apply -f diagnostician-agent.yaml
kubectl apply -f notifier-agent.yaml

kubectl get swarmagents -n incident-response
NAME                      MODEL        REPLICAS   READY   AGE
incident-investigator     qwen2.5:7b   1          1       5s
incident-diagnostician    qwen2.5:7b   1          1       5s
incident-notifier         qwen2.5:7b   1          1       5s

The pipeline

A SwarmTeam wires the three agents into a DAG. Diagnostician waits for investigator, notifier waits for diagnostician.

# incident-team.yaml
spec:
  roles:
    - name: investigator
      swarmAgent: incident-investigator
    - name: diagnostician
      swarmAgent: incident-diagnostician
    - name: notifier
      swarmAgent: incident-notifier
  pipeline:
    - role: investigator
      inputs:
        alert: "{{ .input.alert }}"
    - role: diagnostician
      dependsOn: [investigator]
      inputs:
        investigation: "{{ .steps.investigator.output }}"
        alert: "{{ .input.alert }}"
    - role: notifier
      dependsOn: [diagnostician]
      inputs:
        investigation: "{{ .steps.investigator.output }}"
        diagnosis: "{{ .steps.diagnostician.output }}"

The notifier gets both the investigation and the diagnosis as input, so it has full context when writing the Slack message.

kubectl apply -f incident-team.yaml

Running it

I fed it a simulated PagerDuty alert - high error rate on the payments API:

# sample-incident.yaml
spec:
  teamRef: incident-responder
  input:
    alert: |
      PagerDuty Incident P-48291: High error rate on payments-api

      Severity: High
      Service: payments-api
      Triggered: 2026-04-23T02:45:00Z
      Description: Error rate on payments-api exceeded 5% threshold.
      Current rate: 12.3% 5xx responses over the last 5 minutes.
      Alert count: 47 alerts in the last 8 minutes
kubectl apply -f sample-incident.yaml
kubectl get swarmrun incident-001 -n incident-response -w
NAME           PHASE      AGE
incident-001   Pending    0s
incident-001   Running    2s
incident-001   Succeeded  2m18s

Here's what each agent actually produced.

Investigator - it called all three MCP tools (get_incident, query_logs, query_metrics) and pulled together this report:

{
  "incident_id": "P-48291",
  "service": "payments-api",
  "error_pattern": "Connection timeout and pool exhaustion errors involving PostgreSQL primary server failure",
  "timeline": "Incident started at 02:47 UTC when a postgres failover occurred, leading to an error rate spike from 0.1% to 12.3% by 02:50 UTC",
  "metrics_summary": "Error rate on payments-api exceeded baseline starting at 02:47 UTC during a PostgreSQL failover event, peaking at 12.3%"
}

I checked the audit logs - every data point traces back to the mock Grafana response. No hallucination.

Diagnostician - correctly identified the root cause:

{
  "root_cause": "PostgreSQL primary server failure leading to database connection issues and exhausted connection pool",
  "severity": "high",
  "blast_radius": "payments-api service experiencing high error rates (12.3%)",
  "remediation": "Promote the standby PostgreSQL replica to primary and restart the payments-api deployment to clear connection pool caches",
  "runbook": "",
  "confidence": "high"
}

The empty runbook field is correct - none was provided in the evidence, and I explicitly told the model not to invent URLs. (It tried to on an earlier run. Small models love making up links.)

Notifier - the mock Slack server logged this:

kubectl logs -n incident-response deploy/mock-slack
========================================
SLACK MESSAGE to #incidents
========================================
*Incident Summary*
**Service:** payments-api
**Root Cause:** PostgreSQL primary server failure leading to
database connection issues and exhausted connection pool
**Severity:** High
**Blast Radius:** payments-api service experiencing high error rates (12.3%)
*Remediation Steps:*
- Promote the standby PostgreSQL replica to primary
- Restart the payments-api deployment
- Clear connection pool caches
- Monitor for any persistent issues
========================================

And the PagerDuty mock logged the acknowledgment. Both tools called, both actions completed.

Policy

Once this works, you want to make sure nobody deploys a rogue agent with shell access in this namespace:

# incident-policy.yaml
spec:
  enforcementMode: Enforce
  limits:
    maxDailyTokens: 200000
    maxTokensPerCall: 3000
    maxTimeoutSeconds: 600
  tools:
    deny:
      - "shell/*"
      - "filesystem/*"
  models:
    allowed:
      - "qwen*"
      - "llama*"
RuleWhy
tools.deny: shell/*, filesystem/*Agents can use MCP tools but never execute commands or write files
models.allowed: qwen*, llama*Only approved local models - no surprise API bills
maxDailyTokens: 200000200K tokens/day ceiling - enough for ~20 incidents
maxTokensPerCall: 3000No single LLM call burns more than 3K tokens

Going further

To connect real services, swap the mock URLs:

tools:
  mcp:
    - name: pagerduty
      url: "http://pagerduty-mcp.incident-response.svc:8080"
      auth:
        type: bearer
        secretRef:
          name: pagerduty-api-key

To trigger automatically from PagerDuty webhooks instead of kubectl apply, use a SwarmEvent:

spec:
  source:
    type: webhook
  targets:
    - team: incident-responder
      inputs:
        alert: "{{ .trigger.body.incident.title }}: {{ .trigger.body.incident.description }}"
  concurrencyPolicy: Allow

The operator generates a webhook URL. Point PagerDuty at it and every incident fires the pipeline.

Numbers

MetricValue
Agents3 (investigator + diagnostician + notifier)
Modelqwen2.5:7b (any model works)
Time per incident~2.5 minutes
Tokens per incident~10,000
Cost per incident$0.00 (local model)
ToolsPagerDuty, Grafana, Slack via MCP
GuardrailsPer-agent tool allowlists, namespace policy, token budgets

The 2.5 minutes is mostly Ollama thinking on my laptop. With a faster model or GPU, this would be well under a minute.

Cleanup

kubectl delete namespace incident-response

All the files are in the cookbook. The docs have more on MCP integration and SwarmEvents.


kubeswarm is an open-source Kubernetes operator for managing AI agents. Docs | Cookbook | GitHub

Was this useful?