Chapter 11

Interview Execution

Answer Framework · Failure Scenarios · Tradeoffs Reference · Common Questions

Ready to practise interactively?

Explore this chapter with quizzes, diagrams, and real-world examples in the full interactive experience.

The 6-Step Answer Framework

Use this structure on every system design question. Deviating — especially jumping to architecture before requirements — is the #1 failure signal at Staff level.

Step	What to Do	What to Say
1 · Clarify	Separate functional from non-functional requirements	"What scale are we targeting? Any latency SLA? Consistency requirements?"
2 · Scale	State DAU, concurrent users, p99 target, data volume	"50M DAU, 500K concurrent, <500ms p99, messages stored 90 days."
3 · Architecture	Name the pattern and justify the choice	"I'd use Clean Architecture with offline-first and WebSocket for real-time."
4 · Components	Walk each layer: network, state, DB, sync, background work	Draw the dependency graph. Explain each boundary and why it exists.
5 · Failures	Cover 3–4 failure scenarios without being prompted	"On network loss I queue in outbox. On 5xx I use exponential backoff."
6 · Tradeoffs	Justify every major decision with explicit tradeoffs	"I chose SSE over WebSocket because this stream is read-only, simpler infra."

Worked Example — Design a Mobile Chat / AI App

Walk through the 6-step framework applied to a real question.

Step 1 — Requirements (say these out loud)
Functional: send/receive messages, stream AI responses, offline read
Non-functional: <500ms message delivery p99, 99.9% uptime, E2E encrypted
Scale: 10M DAU, 5 sessions/day, 10 msg/session = 500M msgs/day ~ 6k RPS
Consistency: causal (messages in order per conversation)

Step 2 — Architecture Decision
Real-time: WebSocket for bidirectional chat + SSE for AI token streaming
Persistence: Room (messages) + outbox table (unsent) + DataStore (prefs)
Offline: write-local-first, WorkManager drains outbox on reconnect
Sync: delta sync on foreground (lastSyncedAt cursor per conversation)

Step 3 — Component Walkthrough
UI  →  ChatViewModel (StateFlow)
     →  SendMessageUseCase
        ├── MessageRepository.saveLocal()  [Room tx: message + outbox]
        └── MessageRepository.streamAI()   [OkHttp SSE → callbackFlow]
             └── tokens emitted to StateFlow → Compose recomposes
OutboxWorker (WorkManager) reads PENDING → POST /messages → mark SENT
WebSocketManager (singleton) pushes incoming msgs → Room → Flow re-emit

Step 4 — Failure Scenarios (volunteer these)
Network loss mid-send  → outbox PENDING → WorkManager retries on reconnect
SSE stream drops       → callbackFlow close(e) → ViewModel shows retry UI
Server 5xx             → exponential backoff + jitter, max 5 retries
Process death mid-send → outbox row survives in Room; worker resumes
Auth token expired     → OkHttp Authenticator refreshes → replays request

Interview tip: Key tradeoffs to state: WebSocket over SSE (bidirectional needed for chat). Write-local-first over network-first (offline UX is non-negotiable). Causal consistency over strong (ordering per-conversation is sufficient).

Failure Scenarios — Have These Ready

Proactively addressing failure scenarios shows Staff-level thinking.

Failure	Response Strategy
Network loss mid-request	NetworkCallback detects drop → queue in outbox (Room) → WorkManager retries with exponential backoff
Server 5xx errors	Exponential backoff + jitter → surface error state in UI after N retries → never hammer a failing server
App crash during write	Room transactions ensure atomicity → WorkManager resumes outbox on next launch automatically
Partial streaming response	Buffer tokens received → on disconnect, resume from last confirmed token position if API supports cursors
Stale cache served to user	ETag / Last-Modified headers → background refresh with stale-while-revalidate → never block UI on freshness
Auth token expired mid-request	OkHttp Authenticator intercepts 401 → refreshes access token → replays original request transparently
Out of memory / low memory	onTrimMemory() callback → release L1 cache → Coil/Glide handle bitmap eviction automatically
Push notification not received	FCM is not guaranteed → implement pull fallback (delta sync on app foreground) as safety net

Tradeoffs Reference

Every decision needs a stated tradeoff. Say: 'I chose X over Y because... The cost is...'

Decision	Option A	Option B
API protocol	REST — simple, cacheable, mature tooling	GraphQL — flexible, no over/under-fetching
Real-time transport	WebSocket — bidirectional, low latency	SSE — simpler, server-to-client only
Polling strategy	Long polling — near real-time, fewer requests	Short polling — simple, wastes bandwidth
Update delivery	Push (FCM) — battery efficient, real-time	Pull — simple, adds latency, drains battery
Data freshness	Cache-first — fast UX, may show stale data	Network-first — always fresh, needs connectivity
Offline support	Offline-first — great UX, complex sync logic	Online-only — simple to build, fails without network
Consistency	Strong — always correct, slower writes	Eventual — fast, temporary divergence acceptable
Pagination	Cursor-based — stable under inserts/deletes	Offset-based — simple, but skips/duplicates on changes
UI framework	Full native Compose — best performance, Android only	KMP — shared logic, native UI per platform
DI scope	Singleton — one instance app-wide, fast access	ViewModelScoped — fresh per screen, better isolation

Worked Example — Design a Mobile Payment Flow

Payments interviews at Stripe, Ramp, PayPal, and Google Pay specifically test idempotency, security, and reliability.

Step 1 — Requirements
Functional: initiate payment, confirm status, handle failure, show receipt
Non-functional: exactly-once execution (no double charges), <1s p99 UX
Scale: 1M transactions/day ~ 12 TPS average, 100 TPS peak
Consistency: STRONG — every write to payment state must be durable

Step 2 — Architecture Decisions
Network: HTTPS/REST with TLS 1.3. No WebSocket — request/response is fine.
Idempotency: client generates UUID before sending. Server deduplicates on it.
Token security: card data → Stripe SDK → never touches our servers (PCI scope)
Status polling: POST /payments → 202 Accepted + paymentId → GET /payments/:id
Retry: exponential backoff on 5xx; DO NOT retry on 4xx (non-retryable)

Step 3 — Components
PaymentViewModel
  └── InitiatePaymentUseCase
      ├── TokenizeUseCase  [Stripe SDK — client-side only]
      ├── PaymentRepository.submit(token, idempotencyKey)
      └── PaymentStatusPoller.poll(paymentId) [Flow, 1s intervals, max 30s]

PaymentStateMachine (sealed class):
  IDLE → TOKENIZING → SUBMITTING → POLLING → SUCCESS | FAILED | TIMEOUT

Step 4 — Failure Scenarios
Double-tap submit     → idempotencyKey UUID — server returns same result
Network drop mid-POST → retry with same idempotencyKey → safe to resend
Server 500            → backoff; after 3 attempts show "try again" UI
Timeout (30s poll)    → TIMEOUT state; user contacts support; do NOT retry
App killed during pay → resume from POLLING state via SavedStateHandle

SeniorImplements basic payment flow

StaffDesigns idempotency, retry safety, state machine, and token security scope

PrincipalDefines the payment platform contract — idempotency policy, PCI scope boundaries, fraud signal integration

Interview tip: Key signals: (1) idempotency key is generated client-side before ANY network call, (2) card data never touches your server via a tokenization SDK, (3) payment state is a sealed class — impossible to be in SUCCESS and FAILED simultaneously.

Worked Example — Design an Image Feed (Instagram / Pinterest)

Image feed interviews test cache architecture, progressive loading, scroll performance, and bandwidth optimisation.

Step 1 — Requirements
Functional: infinite scroll feed, images load fast, offline last-seen feed
Non-functional: <200ms image display p99, smooth 60fps scroll, minimal data
Scale: 50M DAU, 200 images/session = 10B image loads/day

Step 2 — Architecture Decisions
Images: Coil with memory + disk cache (OkHttp disk cache)
Feed data: Paging 3 + Room (RemoteMediator for offline-first)
Prefetch: load next page when 3 items from end (PagingConfig.prefetchDistance)
Progressive: load thumbnail first (blurhash), then full-res on display
CDN: serve images from CDN; request correct size via URL params (?w=320&q=80)

Step 3 — Component Walkthrough
FeedViewModel → Pager(config, pagingSource = RoomFeedPagingSource)
RemoteMediator.load() → GET /feed?cursor=X → Room.insertAll() → DB emits
LazyColumn → items(pagingItems, key = { it.id })
  └── AsyncImage(model = ImageRequest.Builder
        .data(post.thumbnailUrl)       // load thumbnail first
        .placeholder(blurhashDrawable)  // instant perceived load
        .crossfade(300)
        .size(ViewSizeResolver(imageView)) // request exact display size
        .build())

Step 4 — Failure Scenarios
No network on launch    → Paging 3 shows Room cache; RemoteMediator retries
Image load fails        → Coil retry(2) + error placeholder
Scroll jank             → @Stable item model + key= in LazyColumn + Macrobenchmark
Memory pressure         → Coil evicts memory cache; disk cache still available

SeniorIntegrates Coil and loads images in a list

StaffDesigns full cache pipeline, CDN size optimization, prefetch strategy, and Paging 3 offline-first

PrincipalOwns image pipeline platform — CDN strategy, adaptive quality, progressive loading standard, bandwidth budget per user tier

Interview tip: Mention: (1) request only the display size from CDN — never download a 4K image for a 300px thumbnail, (2) blurhash/placeholder for perceived performance, (3) Paging 3 + Room RemoteMediator for offline-first feed.

Worked Example — Design Real-Time Ride Tracking (Uber / DoorDash)

Real-time location interviews test background work, battery efficiency, WebSocket reliability, and map rendering.

Step 1 — Requirements
Functional: driver location updates to rider in real-time, ETA updates, route
Non-functional: <3s location update latency, battery-aware, accurate GPS
Scale: 1M active rides at peak, driver updates every 3s ~ 333K location events/s

Step 2 — Architecture Decisions
Driver → Server: WebSocket (bidirectional, persistent, low latency)
Server → Rider: WebSocket push (server fans out to all riders watching driver)
Location: FusedLocationProviderClient PRIORITY_HIGH_ACCURACY in ForegroundService
Interval: 3s during active ride; 30s during pickup/waiting (adaptive)
Battery: PRIORITY_BALANCED_POWER outside geofence radius of pickup

Step 3 — Components
LocationService (ForegroundService)
  └── FusedLocationProviderClient.requestLocationUpdates(3s, 10m)
  └── locationFlow (callbackFlow { } → trySend → awaitClose)
  └── LocationBatcher.buffer(5 updates) → WebSocket.send(batch)

RideViewModel
  └── WebSocketManager.observeDriverLocation()  [SharedFlow, replay=1]
  └── locationState: StateFlow<LatLng>
  └── MapComposable renders GoogleMap with driver marker

ForegroundService notification: "Your driver is en route" (required API 26+)

Step 4 — Failure Scenarios
WebSocket drops        → reconnect with 1s/2s/4s backoff; resend last cursor
GPS signal lost        → fallback to NETWORK provider; show accuracy indicator
App backgrounded       → ForegroundService keeps location running (required)
Battery saver mode     → reduce update interval to 10s; notify user of reduced accuracy
Process death          → ForegroundService auto-restarts via START_STICKY

SeniorImplements basic location tracking

StaffDesigns ForegroundService lifecycle, adaptive intervals, WebSocket reliability, and battery optimisation

PrincipalDefines location platform strategy — accuracy tiers, battery budget, geofencing, multi-modal transport detection

Interview tip: Key signals: (1) ForegroundService is required to continuously receive location updates while backgrounded on Android 10+; ACCESS_BACKGROUND_LOCATION is a separate permission requirement, (2) FusedLocationProviderClient not raw GPS, (3) adaptive interval — fast during ride, slow during wait, (4) WebSocket for bidirectional not SSE (driver also receives route updates).

Common Questions by Company

Know what to expect based on the company you're interviewing with.

Company	Likely Topics
Google / Android team	Baseline Profiles, Compose stability, modularization at scale, Doze/battery, accessibility
Meta / Instagram	Feed rendering at scale, image pipeline, A/B testing architecture, multi-process, offline
Netflix	Video streaming buffering, DRM, download manager, adaptive bitrate, background playback
Airbnb	Deep links, offline maps/search, accessibility, i18n (RTL), design system, complex navigation
Uber / DoorDash	Real-time location tracking (WebSocket), offline-first, background location, push reliability
Stripe / Ramp	Payment reliability, idempotency, token security, SDK design, certificate pinning
OpenAI / Anthropic	LLM token streaming (SSE), reconnection resilience, incremental rendering, latency optimization
Palo Alto Networks	WebView security, certificate pinning, multi-process isolation, root detection, encrypted storage