Behavioral Scenario #12
The 3 AM PagerDuty
“Login is broken for 15% of Android users. It's 3 AM. You're on-call. The team Slack is quiet. What do you do first?”
The Situation
You're a Senior Android engineer on on-call rotation. At 3:07 AM, PagerDuty fires: Android Vitals is reporting a 15% crash rate spike in the login flow, starting 42 minutes ago. Firebase Crashlytics shows a NullPointerException in the token refresh callback. The most recent deploy was 6 hours ago — a backend change to the auth service that your team didn't directly own.
Context
- The crash rate was 0.1% before the spike — now 15% — suggesting a sudden change, not a gradual regression
- The backend team's deploy notes don't mention any changes to the auth token format
- A rollback of the backend change takes 25 minutes and requires the on-call backend engineer to execute
- Your staging environment uses a test auth endpoint and may not reproduce the issue
- Some affected users are reporting being unable to complete payment flows — making this business-critical
The Question
“Walk me through how you'd handle a production P0 during an on-call rotation when the root cause is unclear.”
Response Options
One of these is the strongest response. The others reflect common approaches with real trade-offs.
I woke up the backend on-call engineer immediately and told them to roll back their deploy since it was the only recent change.
I opened Crashlytics, identified the NullPointerException stack trace, checked git blame on the affected file, established that the failure correlated precisely with the backend deploy timestamp, paged the backend on-call with a written summary of the correlation and asked them to confirm before rollback, and posted a status update to the incident Slack channel.
I deployed a hotfix disabling the token refresh retry logic to stop the crashes, then investigated root cause once users were unblocked.
I read the crash report and Slack history for 30 minutes to fully understand the issue before taking any action.
The Debrief
Why the Best Response Works
Answer B demonstrates the correct on-call mental model: diagnose with available data → establish causal correlation → escalate with evidence → communicate status proactively. The written correlation summary to the backend on-call is the most important detail — it respects their time, gives them actionable information, and speeds up the rollback decision. Status updates prevent your manager and their manager from waking up to a silent incident.
What to Avoid
The rollback-first reflex is the most common mistake. It feels like action, but it's unconfirmed action — the backend deploy may have no causal relationship. If it doesn't, you've woken up another engineer unnecessarily and delayed finding the real cause. The hotfix without impact analysis is worse: it's a ship-under-pressure mistake that can turn a crash rate into a silent auth failure affecting everyone.
What the Interviewer Is Probing
The interviewer is evaluating your incident response instincts: do you diagnose before escalating? Do you communicate proactively? Do you assess the blast radius of a mitigation before applying it? These instincts reveal whether you'll be calm and useful at 3 AM or chaotic and expensive.
SOAR Structure
**Situation:** 15% login crash rate spike at 3 AM, NullPointerException in token refresh callback, backend deploy 6 hours prior. **Obstacle:** Unclear causality; backend team asleep; staging may not reproduce; payment flows blocked. **Action:** Crashlytics trace analysis; correlated crash timestamp with deploy; paged backend on-call with written summary and asked for confirmation; posted status update; rollback executed in 25 min. **Result:** Crash rate back to 0.1% in 30 min; post-mortem identified missing token format validation in client; alert threshold lowered.
The Learning Arc
"This taught me that on-call is not about being the fastest to act — it's about being the fastest to act correctly. The 5 minutes I spent reading the stack trace and correlating timestamps meant the backend engineer could confirm the rollback in 2 minutes instead of 20. Speed comes from quality of diagnosis, not speed of escalation."
IC Level Calibration
Diagnose from available telemetry, establish correlation with the recent deploy, escalate the backend on-call with a written summary, post a status update, and execute rollback once confirmed. Write a post-mortem the next day.
Same as senior, plus: identify why the crash rate threshold didn't alert sooner (if it should have), and use the post-mortem to drive improvements to the incident runbook and alert thresholds as org-wide standards.
Same as staff, plus: assess whether the incident revealed a gap in the release safety process (e.g., coordinated backend/client deploy guardrails). Drive the process change that prevents a backend deploy from affecting client stability without a coordinated review.
Company Calibration
SRE culture: diagnose, mitigate, communicate in that order
Amazon
LP: Bias for Action — speed matters in high-stakes situations
Stripe
Incident management: data before escalation, communicate early
Coinbase
Mission-critical reliability: calm decision-making under pressure
Want to pick your response and see the full analysis?
Practice This Scenario Interactively