Article 5 of 7

Leading Through Incidents Without Losing Trust

How to lead teams through production crises constructively.

14 minAdvanced

✦

Key Takeaway

Anyone can lead a team when the sprint is going smoothly. Your real value as a tech lead is tested at 3 AM when everything is on fire. The Incident Commander mindset — staying out of the code, coordinating response, protecting your engineers from stakeholder noise — is the single most important leadership skill you'll develop. The blameless post-mortem is where trust is built or destroyed. Fix the system that allowed the error, never the person who made it.

It's 2:47 AM on a Friday.

Your phone goes off. PagerDuty. Then again. And again. You open your laptop, squinting at the screen.

Database CPU: 98%. API error rate: 74%. Your customer support queue has gone from 12 open tickets to 340 in the last 20 minutes. Enterprise clients are angry. Your on-call engineer is already in the incident Slack channel, messages coming fast and uncertain.

This is the crucible of technical leadership.

Anyone can lead a team when the sprint is going well, standups are cheerful, and the tests are green. Your real value as a tech lead gets forged — or found wanting — in moments like this. How you handle the next three hours will determine not just the system's recovery, but the psychological safety of your entire team for months afterward.

The Incident Commander Mindset

The most dangerous thing you can do when an alert fires is to immediately open a database console and start running queries.

If you're the tech lead and you're typing, you're not leading. You've put your head back in the trenches and left the command position empty. The engineers who should be diagnosing the problem are now looking at each other wondering who's coordinating, who's communicating to stakeholders, and whether the right people are even involved yet.

Your job during a major incident is to be the Incident Commander — the person who holds the map, coordinates the effort, handles external communication, and protects the engineers who are actually fixing the problem.

There's a useful analogy here: in a major surgery, the attending surgeon doesn't also take the anesthesiology calls. Everyone is focused on their role. The moment you try to do all roles yourself in an incident, you create chaos.

Step 1: Establish the War Room

Within the first two minutes of an incident, create a single, central communication channel. A dedicated Slack channel, a Zoom call, or both.

If communication is fragmented across three DMs, two email threads, and someone's personal phone, you will duplicate work, miss critical information, and make the incident worse. All incident communication lives in one place. Immediately.

Step 2: Protect the Engineers Who Are Digging

The engineers deep in the logs — the "diggers" — need uninterrupted focus. A diagnosis that should take 15 minutes takes 45 when the engineer is fielding anxious questions from three different stakeholders every five minutes.

As the Incident Commander, you become the shield.

You handle all communication with the VP of Product, the CEO, the customer success team, and anyone else who wants updates. You give structured, time-boxed updates: "We have two engineers investigating a database performance issue right now. I will post an update in this channel at 3:15 AM with our current hypothesis."

Then you hold that line and let the diggers dig.

Step 3: Force Hypothesis-Driven Action

Panic makes smart engineers do dangerous things.

Under pressure, someone will inevitably propose a drastic action without thinking it through: "Let's reboot the primary database node." Or in 2026 terms: "The AI monitoring dashboard is suggesting we scale up the database cluster — let's just do it."

Do not allow unilateral, drastic actions without a stated hypothesis and a rollback plan.

Force the team to answer three questions before any destructive action is taken:

What is our hypothesis about what's causing this?
What do we expect to happen if this action is correct?
What is the risk if we're wrong, and what's our rollback plan?

This takes 90 seconds. It prevents incidents from multiplying in severity. I've seen well-intentioned engineers accidentally cause a cascading failure by taking an undiscussed action under pressure. The discipline of hypothesis-first saves you from that.

A note on AI tools in incidents: in 2026, you may have AI-powered monitoring and analysis tools suggesting probable root causes. These can be genuinely useful for narrowing the hypothesis space quickly. But treat them as inputs to human judgment, not decisions. An AI suggestion to restart a service is still a drastic action that requires a human to own the call and the rollback.

Step 4: Communicate in Structured Intervals

Never go silent during an incident, even when you don't have answers yet.

Every 15-20 minutes, post a structured update to the stakeholder channel:

Status: What we know right now
Hypothesis: What we think is causing it
Current action: What the team is doing about it
Next update: When you'll check in again

"We don't know yet" is a valid status. What's not valid is silence, which creates anxiety and forces stakeholders to start messaging your engineers directly.

The Morning After: The Blameless Post-Mortem

The incident is mitigated. The system is recovering. Your team is exhausted. You send the "all clear" message at 5:30 AM and everyone goes back to sleep.

The real leadership work begins 48 hours later with the post-mortem.

The goal of a post-mortem is not to find out who caused the incident. It is to find out why the system allowed an engineer to cause the incident in the first place.

This distinction sounds philosophical but is deeply practical. When teams run blame-first post-mortems, engineers stop reporting near-misses, hide their mistakes, and become risk-averse in ways that slow the team down for years. When teams run blameless post-mortems, engineers bring problems forward early, incidents decrease, and the culture becomes genuinely safer.

Here's the shift: if an engineer dropped a production table because they mistyped a command, the problem is not the engineer. The problem is that the system gave an engineer the ability to drop a production table without any safeguard — no confirmation step, no peer review, no dry-run mode.

The Five Whys Without the Witch Hunt

Use the Five Whys technique to drill down to the systemic root cause, keeping individuals entirely out of the chain.

Why did the API go down? Memory on the pods was exhausted.
Why was memory exhausted? A new feature loaded 600,000 records into memory in a single query instead of paginating.
Why did this get merged? The code review didn't catch the missing pagination.
Why did it pass staging? Staging has only 800 records in the relevant table; the memory spike never manifested.
Why is staging data so sparse? The data generation script hasn't been updated in two years and doesn't reflect production scale.

Notice: no individual is blamed at any step. Every step identifies a systemic gap.

The action items from this post-mortem are not "remind engineers to be more careful in code reviews." They are:

Implement mandatory query result limits at the ORM level to prevent unbounded queries
Rewrite the staging data generation script to use anonymised, production-scale data
Add a code review checklist item specifically for queries that touch high-volume tables

Systemic fixes prevent the next incident. Blame prevents the next engineer from telling you about the near-miss before it becomes one.

Building the Post-Incident Ritual

The post-mortem document matters, but the ritual around it matters more.

Run the post-mortem as a facilitated meeting within 48-72 hours of the incident, while the details are fresh. Keep it under an hour. Separate the timeline reconstruction (factual) from the analysis (interpretive) to prevent conflation of "what happened" with "whose fault was it."

Publish the post-mortem to the entire engineering organisation, not just the involved team. Shared learnings compound over time. The authentication incident you shared publicly last quarter may have just prevented another team from making the same architectural mistake.

And close every post-mortem with prioritised action items that are actually assigned and tracked. A post-mortem that produces a document nobody reads or action items nobody owns is theater. The value is in the system changes.

Key Takeaways

Stay out of the trenches during a major incident. If you're typing, you're not leading. Your job is to coordinate, communicate, and protect.
Shield your engineers. The people fixing the problem need 15 uninterrupted minutes more than stakeholders need 15-minute updates.
Demand hypotheses before drastic actions. Even AI-suggested actions need a human owner and a rollback plan.
Fix the system, not the person. Blameless post-mortems focus on why the architecture allowed an error to become an incident, not on who made the error.
Your next step this week: If you don't have a defined incident response process — even a rough one — write one before you need it. The worst time to define your war room protocol is at 3 AM.

Managing Technical Debt as a Leadership Problem Building an Engineering Culture That Outlasts You