Chapter 11
Interview Execution
Answer Framework · Failure Scenarios · Tradeoffs Reference · Common Questions
Ready to practise interactively?
Explore this chapter with quizzes, diagrams, and real-world examples in the full interactive experience.
The 6-Step Answer Framework
Use this structure on every system design question. Deviating — especially jumping to architecture before requirements — is the #1 failure signal at Staff level.
| Step | What to Do | What to Say |
|---|---|---|
| 1 · Clarify | Separate functional from non-functional requirements | "What scale are we targeting? Any latency SLA? Consistency requirements?" |
| 2 · Scale | State DAU, concurrent users, p99 target, data volume | "50M DAU, 500K concurrent, <500ms p99, messages stored 90 days." |
| 3 · Architecture | Name the pattern and justify the choice | "I'd use Clean Architecture with offline-first and WebSocket for real-time." |
| 4 · Components | Walk each layer: network, state, DB, sync, background work | Draw the dependency graph. Explain each boundary and why it exists. |
| 5 · Failures | Cover 3–4 failure scenarios without being prompted | "On network loss I queue in outbox. On 5xx I use exponential backoff." |
| 6 · Tradeoffs | Justify every major decision with explicit tradeoffs | "I chose SSE over WebSocket because this stream is read-only, simpler infra." |
Worked Example — Design a Mobile Chat / AI App
Walk through the 6-step framework applied to a real question.
Step 1 — Requirements (say these out loud)
Functional: send/receive messages, stream AI responses, offline read
Non-functional: <500ms message delivery p99, 99.9% uptime, E2E encrypted
Scale: 10M DAU, 5 sessions/day, 10 msg/session = 500M msgs/day ~ 6k RPS
Consistency: causal (messages in order per conversation)
Step 2 — Architecture Decision
Real-time: WebSocket for bidirectional chat + SSE for AI token streaming
Persistence: Room (messages) + outbox table (unsent) + DataStore (prefs)
Offline: write-local-first, WorkManager drains outbox on reconnect
Sync: delta sync on foreground (lastSyncedAt cursor per conversation)
Step 3 — Component Walkthrough
UI → ChatViewModel (StateFlow)
→ SendMessageUseCase
├── MessageRepository.saveLocal() [Room tx: message + outbox]
└── MessageRepository.streamAI() [OkHttp SSE → callbackFlow]
└── tokens emitted to StateFlow → Compose recomposes
OutboxWorker (WorkManager) reads PENDING → POST /messages → mark SENT
WebSocketManager (singleton) pushes incoming msgs → Room → Flow re-emit
Step 4 — Failure Scenarios (volunteer these)
Network loss mid-send → outbox PENDING → WorkManager retries on reconnect
SSE stream drops → callbackFlow close(e) → ViewModel shows retry UI
Server 5xx → exponential backoff + jitter, max 5 retries
Process death mid-send → outbox row survives in Room; worker resumes
Auth token expired → OkHttp Authenticator refreshes → replays requestInterview tip: Key tradeoffs to state: WebSocket over SSE (bidirectional needed for chat). Write-local-first over network-first (offline UX is non-negotiable). Causal consistency over strong (ordering per-conversation is sufficient).
Failure Scenarios — Have These Ready
Proactively addressing failure scenarios shows Staff-level thinking.
| Failure | Response Strategy |
|---|---|
| Network loss mid-request | NetworkCallback detects drop → queue in outbox (Room) → WorkManager retries with exponential backoff |
| Server 5xx errors | Exponential backoff + jitter → surface error state in UI after N retries → never hammer a failing server |
| App crash during write | Room transactions ensure atomicity → WorkManager resumes outbox on next launch automatically |
| Partial streaming response | Buffer tokens received → on disconnect, resume from last confirmed token position if API supports cursors |
| Stale cache served to user | ETag / Last-Modified headers → background refresh with stale-while-revalidate → never block UI on freshness |
| Auth token expired mid-request | OkHttp Authenticator intercepts 401 → refreshes access token → replays original request transparently |
| Out of memory / low memory | onTrimMemory() callback → release L1 cache → Coil/Glide handle bitmap eviction automatically |
| Push notification not received | FCM is not guaranteed → implement pull fallback (delta sync on app foreground) as safety net |
Tradeoffs Reference
Every decision needs a stated tradeoff. Say: 'I chose X over Y because... The cost is...'
| Decision | Option A | Option B |
|---|---|---|
| API protocol | REST — simple, cacheable, mature tooling | GraphQL — flexible, no over/under-fetching |
| Real-time transport | WebSocket — bidirectional, low latency | SSE — simpler, server-to-client only |
| Polling strategy | Long polling — near real-time, fewer requests | Short polling — simple, wastes bandwidth |
| Update delivery | Push (FCM) — battery efficient, real-time | Pull — simple, adds latency, drains battery |
| Data freshness | Cache-first — fast UX, may show stale data | Network-first — always fresh, needs connectivity |
| Offline support | Offline-first — great UX, complex sync logic | Online-only — simple to build, fails without network |
| Consistency | Strong — always correct, slower writes | Eventual — fast, temporary divergence acceptable |
| Pagination | Cursor-based — stable under inserts/deletes | Offset-based — simple, but skips/duplicates on changes |
| UI framework | Full native Compose — best performance, Android only | KMP — shared logic, native UI per platform |
| DI scope | Singleton — one instance app-wide, fast access | ViewModelScoped — fresh per screen, better isolation |
Worked Example — Design a Mobile Payment Flow
Payments interviews at Stripe, Ramp, PayPal, and Google Pay specifically test idempotency, security, and reliability.
Step 1 — Requirements
Functional: initiate payment, confirm status, handle failure, show receipt
Non-functional: exactly-once execution (no double charges), <1s p99 UX
Scale: 1M transactions/day ~ 12 TPS average, 100 TPS peak
Consistency: STRONG — every write to payment state must be durable
Step 2 — Architecture Decisions
Network: HTTPS/REST with TLS 1.3. No WebSocket — request/response is fine.
Idempotency: client generates UUID before sending. Server deduplicates on it.
Token security: card data → Stripe SDK → never touches our servers (PCI scope)
Status polling: POST /payments → 202 Accepted + paymentId → GET /payments/:id
Retry: exponential backoff on 5xx; DO NOT retry on 4xx (non-retryable)
Step 3 — Components
PaymentViewModel
└── InitiatePaymentUseCase
├── TokenizeUseCase [Stripe SDK — client-side only]
├── PaymentRepository.submit(token, idempotencyKey)
└── PaymentStatusPoller.poll(paymentId) [Flow, 1s intervals, max 30s]
PaymentStateMachine (sealed class):
IDLE → TOKENIZING → SUBMITTING → POLLING → SUCCESS | FAILED | TIMEOUT
Step 4 — Failure Scenarios
Double-tap submit → idempotencyKey UUID — server returns same result
Network drop mid-POST → retry with same idempotencyKey → safe to resend
Server 500 → backoff; after 3 attempts show "try again" UI
Timeout (30s poll) → TIMEOUT state; user contacts support; do NOT retry
App killed during pay → resume from POLLING state via SavedStateHandleInterview tip: Key signals: (1) idempotency key is generated client-side before ANY network call, (2) card data never touches your server via a tokenization SDK, (3) payment state is a sealed class — impossible to be in SUCCESS and FAILED simultaneously.
Worked Example — Design an Image Feed (Instagram / Pinterest)
Image feed interviews test cache architecture, progressive loading, scroll performance, and bandwidth optimisation.
Step 1 — Requirements
Functional: infinite scroll feed, images load fast, offline last-seen feed
Non-functional: <200ms image display p99, smooth 60fps scroll, minimal data
Scale: 50M DAU, 200 images/session = 10B image loads/day
Step 2 — Architecture Decisions
Images: Coil with memory + disk cache (OkHttp disk cache)
Feed data: Paging 3 + Room (RemoteMediator for offline-first)
Prefetch: load next page when 3 items from end (PagingConfig.prefetchDistance)
Progressive: load thumbnail first (blurhash), then full-res on display
CDN: serve images from CDN; request correct size via URL params (?w=320&q=80)
Step 3 — Component Walkthrough
FeedViewModel → Pager(config, pagingSource = RoomFeedPagingSource)
RemoteMediator.load() → GET /feed?cursor=X → Room.insertAll() → DB emits
LazyColumn → items(pagingItems, key = { it.id })
└── AsyncImage(model = ImageRequest.Builder
.data(post.thumbnailUrl) // load thumbnail first
.placeholder(blurhashDrawable) // instant perceived load
.crossfade(300)
.size(ViewSizeResolver(imageView)) // request exact display size
.build())
Step 4 — Failure Scenarios
No network on launch → Paging 3 shows Room cache; RemoteMediator retries
Image load fails → Coil retry(2) + error placeholder
Scroll jank → @Stable item model + key= in LazyColumn + Macrobenchmark
Memory pressure → Coil evicts memory cache; disk cache still availableInterview tip: Mention: (1) request only the display size from CDN — never download a 4K image for a 300px thumbnail, (2) blurhash/placeholder for perceived performance, (3) Paging 3 + Room RemoteMediator for offline-first feed.
Worked Example — Design Real-Time Ride Tracking (Uber / DoorDash)
Real-time location interviews test background work, battery efficiency, WebSocket reliability, and map rendering.
Step 1 — Requirements
Functional: driver location updates to rider in real-time, ETA updates, route
Non-functional: <3s location update latency, battery-aware, accurate GPS
Scale: 1M active rides at peak, driver updates every 3s ~ 333K location events/s
Step 2 — Architecture Decisions
Driver → Server: WebSocket (bidirectional, persistent, low latency)
Server → Rider: WebSocket push (server fans out to all riders watching driver)
Location: FusedLocationProviderClient PRIORITY_HIGH_ACCURACY in ForegroundService
Interval: 3s during active ride; 30s during pickup/waiting (adaptive)
Battery: PRIORITY_BALANCED_POWER outside geofence radius of pickup
Step 3 — Components
LocationService (ForegroundService)
└── FusedLocationProviderClient.requestLocationUpdates(3s, 10m)
└── locationFlow (callbackFlow { } → trySend → awaitClose)
└── LocationBatcher.buffer(5 updates) → WebSocket.send(batch)
RideViewModel
└── WebSocketManager.observeDriverLocation() [SharedFlow, replay=1]
└── locationState: StateFlow<LatLng>
└── MapComposable renders GoogleMap with driver marker
ForegroundService notification: "Your driver is en route" (required API 26+)
Step 4 — Failure Scenarios
WebSocket drops → reconnect with 1s/2s/4s backoff; resend last cursor
GPS signal lost → fallback to NETWORK provider; show accuracy indicator
App backgrounded → ForegroundService keeps location running (required)
Battery saver mode → reduce update interval to 10s; notify user of reduced accuracy
Process death → ForegroundService auto-restarts via START_STICKYInterview tip: Key signals: (1) ForegroundService is required to continuously receive location updates while backgrounded on Android 10+; ACCESS_BACKGROUND_LOCATION is a separate permission requirement, (2) FusedLocationProviderClient not raw GPS, (3) adaptive interval — fast during ride, slow during wait, (4) WebSocket for bidirectional not SSE (driver also receives route updates).
Common Questions by Company
Know what to expect based on the company you're interviewing with.
| Company | Likely Topics |
|---|---|
| Google / Android team | Baseline Profiles, Compose stability, modularization at scale, Doze/battery, accessibility |
| Meta / Instagram | Feed rendering at scale, image pipeline, A/B testing architecture, multi-process, offline |
| Netflix | Video streaming buffering, DRM, download manager, adaptive bitrate, background playback |
| Airbnb | Deep links, offline maps/search, accessibility, i18n (RTL), design system, complex navigation |
| Uber / DoorDash | Real-time location tracking (WebSocket), offline-first, background location, push reliability |
| Stripe / Ramp | Payment reliability, idempotency, token security, SDK design, certificate pinning |
| OpenAI / Anthropic | LLM token streaming (SSE), reconnection resilience, incremental rendering, latency optimization |
| Palo Alto Networks | WebView security, certificate pinning, multi-process isolation, root detection, encrypted storage |