DAILY NEWS

Stay Ahead, Stay Informed – Every Day

Advertisement

AI Doesn’t Replace On-Call Judgment



I’ve had a few people ask me, in roughly this order, whether AI is going to replace incident response, then whether I’ve actually tried it, then whether it’s any good. So here’s the honest answer, from a thing I actually built and ran against real incidents, not a hot take.Short version: it didn’t replace the judgment call. It replaced the 30-90 minutes of legwork you do before you’re in a position to make that call.The toil nobody questions because everyone does it the same wayEvery incident starts the same way. The page goes off, someone’s IC, and the first chunk of the incident is always the same five things: read the alert thread, pull logs, check what deployed recently, look for related change records, build a timeline. Different incident, same workflow, every time. That repetition is the tell. It’s not judgment, it’s reconnaissance, and reconnaissance is exactly the kind of bounded, repeatable task you can hand off.So I built a Claude Code skill for it, /incident-investigate. Point it at an incident channel and it:reads the thread and pulls out the entities that matter (service, error class, cluster, execution ID)queries the observability backend for matching error patternschecks deploy history for anything correlated in timesearches for related change recordscomes back with a structured hypothesis, citations includedTook about two hours to build across two sessions. Takes three minutes to run, versus the 30-90 minutes it replaces.The rule that actually mattersThe speed is the easy part to sell. The part that makes it safe to point at a live incident channel is one rule: no hypothesis without at least one independent data source confirming it.If the logs are inconclusive and nothing in deploy history lines up, it doesn’t guess. It says “insufficient evidence” and tells you exactly what to check next. That’s it.I added that rule because I’d already watched the failure mode it prevents: an earlier, less disciplined AI-in-incidents effort posted an unverified hypothesis into a live channel, and the IC burned real time chasing a lead that was never grounded in anything. A tool that’s occasionally brilliant and occasionally confidently wrong is worse than no tool, because you can’t tell which one you’re getting. A tool that says “I don’t know, here’s why” is one people will actually keep using.Did it actually work, or did I just build a confident toy?I didn’t trust my own read on it, so I replayed it against three real incidents with already-known root causes:Timeout cascade: root cause was a timeout threshold plus a traffic shift. Caught it (medium confidence).Bad deploy: root cause was wrong routing in a recent PR. Caught it (high confidence).Upstream outage: root cause was an external DNS failure. Caught it, correctly said “not us.”3 for 3. Zero false claims. Zero false blame on our own deploys when the real cause was external.That third one is the one I actually care about. Deploys are always happening, so there’s always a plausible-looking wrong answer on the table. Pointing at the most recent deploy is the laziest possible failure mode for a tool like this. Correctly saying “this isn’t us, here’s why, go check the upstream provider” is the difference between a tool that saves time and a tool that generates a new chore: someone debunking a false lead.One of those three incidents also had a ten-hour detection gap before anyone noticed sustained errors on an affected endpoint. Run this at any point in those ten hours and you’ve got root cause in three minutes instead of ten hours. That’s not a productivity number, that’s ten hours of customer impact that didn’t have to happen.So, does it replace the IC?No. The IC still decides what to communicate, when to escalate, when to roll back versus ride it out, who to pull in. None of that is reconnaissance, all of it is judgment, and I have zero interest in automating it. What changes is when that judgment gets exercised: minute three, with evidence in hand, instead of minute forty-five, after manually rebuilding the same evidence everyone rebuilds every time.This generalizes past incidents, too. Anywhere a person’s first 30-90 minutes on a task is “query the same 3-4 systems, in the same order, for the same kind of signal,” that’s not really a job description, that’s a function signature. Automate the function. Keep the human for the part that’s actually a decision.If you’re building something similar: spend more time than feels necessary figuring out which data sources you can actually query before you write a line of the skill – that’s where the dead ends live. Write down the one paragraph of “why this design, not that one” before you touch code. And don’t trust it near anything live until you’ve replayed it against real history with known answers, like I did above. None of that’s AI-specific advice, it’s just the stuff that’s easy to skip under deadline pressure and AI tooling makes cheap enough to actually do.Want to compare notes on grounding gates, Claude Code skills, or incident response tooling in general? Come find me on GitHub, Hachyderm, or over on swamp-club.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *