The Silent Crash

“Your code crashed 2% of users for 48 hours. Your VP found out from a user review. What do you do?”

The Situation

You're a Senior Android engineer. A null-pointer exception in the background sync service you wrote caused silent crashes on Android 10 devices. The crash rate sat at 1.8% — just below your team's automated alert threshold of 2%. No one noticed for 48 hours. A VP saw it mentioned in a user review and asked your EM directly.

Context

The crash was in code you wrote and reviewed yourself — no other engineer touched it
Approximately 40,000 user sessions were affected over 48 hours
The fix itself took 20 minutes once identified — it was a simple null-check
Your team had discussed improving crash monitoring two quarters ago but it was deprioritized
The VP's message to your EM had a curt tone — it was clearly on your EM's radar

The Question

“Tell me about a time something you shipped caused an unexpected problem in production. How did you handle it?”

Response Options

One of these is the strongest response. The others reflect common approaches with real trade-offs.

I explained to my manager that the crash rate was below our alert threshold and the fix was deployed quickly once we were notified.

I apologized to the team, shipped the fix immediately, and tried to move on without dwelling on it.

I shipped the hotfix within the hour, then wrote a blameless postmortem covering what failed, why monitoring missed it, and a concrete proposal to lower the alert threshold and add crash-rate tracking to our release checklist. I presented it to the team the next day.

I fixed the crash, updated the monitoring threshold myself, and followed up with the VP directly to let them know it was resolved.

The Debrief

Why the Best Response Works

Answer C demonstrates the full ownership loop: immediate fix (shows you can act fast), systemic analysis (shows you can think structurally), and team presentation (shows you can turn personal failure into collective improvement). The blameless postmortem framing is specific — it signals familiarity with engineering best practices and prevents a blame culture from forming.

What to Avoid

Framing the crash as a monitoring failure, not your failure. Even if the monitoring threshold was too high, the bug was yours. Senior engineers own both. The moment you say 'it was below the threshold,' you've lost the interviewer — because that's what someone says when they're covering themselves, not when they're growing.

What the Interviewer Is Probing

The interviewer is measuring ownership depth. Do you fix the symptom or the system? Can you extract organizational learning from a personal mistake? This is one of the most revealing behavioral questions because failure responses are where character shows.

SOAR Structure

**Situation:** NPE in background sync crashed 1.8% of Android 10 users — below alert threshold — for 48 hours. **Obstacle:** No alert fired; VP found out before the team did. **Action:** Hotfix in 1 hour; blameless postmortem the same day; proposed threshold reduction and release checklist item; presented to team. **Result:** Zero similar incidents in 6 months; release checklist updated; postmortem format adopted by two other teams.

The Learning Arc

"This taught me that alert thresholds are a policy decision, not a technical one — they reflect what you're willing to accept, not just what's technically possible to detect. After this incident, I changed my personal bar: any crash affecting >0.5% of a user segment is a P1 regardless of aggregate rate. I also started requiring crash-rate review as part of every PR that touches background services."

IC Level Calibration

senior · Primary Target

Fix + blameless postmortem + present findings to the team. Own both the bug and the monitoring gap. Propose a specific process change to prevent recurrence.

staff

Same as senior, plus: the monitoring threshold change becomes the standard for all background services, not just yours. You drive the change org-wide and get it into the release checklist before the next sprint.

principal

The postmortem template you create becomes the company standard. The monitoring improvement is rolled into the platform team's reliability roadmap. You use this incident to make the case for a dedicated reliability rotation.

Company Calibration

Amazon

LP: Ownership — leaders never say 'that's not my job'

Airbnb

Belonging: building trust through accountability and transparency

BlackRock

Risk assessment + decision quality under pressure

Google

Googleyness: Intellectual humility + learning orientation

Want to pick your response and see the full analysis?

Practice This Scenario Interactively