seniorHandling Failure

Behavioral Scenario #12

The 3 AM PagerDuty

Login is broken for 15% of Android users. It's 3 AM. You're on-call. The team Slack is quiet. What do you do first?

The Situation

You're a Senior Android engineer on on-call rotation. At 3:07 AM, PagerDuty fires: Android Vitals is reporting a 15% crash rate spike in the login flow, starting 42 minutes ago. Firebase Crashlytics shows a NullPointerException in the token refresh callback. The most recent deploy was 6 hours ago — a backend change to the auth service that your team didn't directly own.

Context

  • The crash rate was 0.1% before the spike — now 15% — suggesting a sudden change, not a gradual regression
  • The backend team's deploy notes don't mention any changes to the auth token format
  • A rollback of the backend change takes 25 minutes and requires the on-call backend engineer to execute
  • Your staging environment uses a test auth endpoint and may not reproduce the issue
  • Some affected users are reporting being unable to complete payment flows — making this business-critical

The Question

Walk me through how you'd handle a production P0 during an on-call rotation when the root cause is unclear.

Response Options

One of these is the strongest response. The others reflect common approaches with real trade-offs.

I woke up the backend on-call engineer immediately and told them to roll back their deploy since it was the only recent change.

I opened Crashlytics, identified the NullPointerException stack trace, checked git blame on the affected file, established that the failure correlated precisely with the backend deploy timestamp, paged the backend on-call with a written summary of the correlation and asked them to confirm before rollback, and posted a status update to the incident Slack channel.

I deployed a hotfix disabling the token refresh retry logic to stop the crashes, then investigated root cause once users were unblocked.

I read the crash report and Slack history for 30 minutes to fully understand the issue before taking any action.

The Debrief

Why the Best Response Works

Answer B demonstrates the correct on-call mental model: diagnose with available data → establish causal correlation → escalate with evidence → communicate status proactively. The written correlation summary to the backend on-call is the most important detail — it respects their time, gives them actionable information, and speeds up the rollback decision. Status updates prevent your manager and their manager from waking up to a silent incident.

What to Avoid

The rollback-first reflex is the most common mistake. It feels like action, but it's unconfirmed action — the backend deploy may have no causal relationship. If it doesn't, you've woken up another engineer unnecessarily and delayed finding the real cause. The hotfix without impact analysis is worse: it's a ship-under-pressure mistake that can turn a crash rate into a silent auth failure affecting everyone.

What the Interviewer Is Probing

The interviewer is evaluating your incident response instincts: do you diagnose before escalating? Do you communicate proactively? Do you assess the blast radius of a mitigation before applying it? These instincts reveal whether you'll be calm and useful at 3 AM or chaotic and expensive.

SOAR Structure

**Situation:** 15% login crash rate spike at 3 AM, NullPointerException in token refresh callback, backend deploy 6 hours prior. **Obstacle:** Unclear causality; backend team asleep; staging may not reproduce; payment flows blocked. **Action:** Crashlytics trace analysis; correlated crash timestamp with deploy; paged backend on-call with written summary and asked for confirmation; posted status update; rollback executed in 25 min. **Result:** Crash rate back to 0.1% in 30 min; post-mortem identified missing token format validation in client; alert threshold lowered.

The Learning Arc

"This taught me that on-call is not about being the fastest to act — it's about being the fastest to act correctly. The 5 minutes I spent reading the stack trace and correlating timestamps meant the backend engineer could confirm the rollback in 2 minutes instead of 20. Speed comes from quality of diagnosis, not speed of escalation."

IC Level Calibration

senior · Primary Target

Diagnose from available telemetry, establish correlation with the recent deploy, escalate the backend on-call with a written summary, post a status update, and execute rollback once confirmed. Write a post-mortem the next day.

staff

Same as senior, plus: identify why the crash rate threshold didn't alert sooner (if it should have), and use the post-mortem to drive improvements to the incident runbook and alert thresholds as org-wide standards.

principal

Same as staff, plus: assess whether the incident revealed a gap in the release safety process (e.g., coordinated backend/client deploy guardrails). Drive the process change that prevents a backend deploy from affecting client stability without a coordinated review.

Company Calibration

Google

SRE culture: diagnose, mitigate, communicate in that order

Amazon

LP: Bias for Action — speed matters in high-stakes situations

Stripe

Incident management: data before escalation, communicate early

Coinbase

Mission-critical reliability: calm decision-making under pressure

Want to pick your response and see the full analysis?

Practice This Scenario Interactively