Chapter 11

Interview Execution

Answer Framework · Failure Scenarios · Tradeoffs Reference · Common Questions

Ready to practise interactively?

Explore this chapter with quizzes, diagrams, and real-world examples in the full interactive experience.

Open Interactive Chapter →

The 6-Step Answer Framework

Use this structure on every system design question. Deviating — especially jumping to architecture before requirements — is the #1 failure signal at Staff level.

StepWhat to DoWhat to Say
1 · ClarifySeparate functional from non-functional requirements"What scale are we targeting? Any latency SLA? Consistency requirements?"
2 · ScaleState DAU, concurrent users, p99 target, data volume"50M DAU, 500K concurrent, <500ms p99, messages stored 90 days."
3 · ArchitectureName the pattern and justify the choice"I'd use Clean Architecture with offline-first and WebSocket for real-time."
4 · ComponentsWalk each layer: network, state, DB, sync, background workDraw the dependency graph. Explain each boundary and why it exists.
5 · FailuresCover 3–4 failure scenarios without being prompted"On network loss I queue in outbox. On 5xx I use exponential backoff."
6 · TradeoffsJustify every major decision with explicit tradeoffs"I chose SSE over WebSocket because this stream is read-only, simpler infra."

Worked Example — Design a Mobile Chat / AI App

Walk through the 6-step framework applied to a real question.

Step 1 — Requirements (say these out loud)
Functional: send/receive messages, stream AI responses, offline read
Non-functional: <500ms message delivery p99, 99.9% uptime, E2E encrypted
Scale: 10M DAU, 5 sessions/day, 10 msg/session = 500M msgs/day ~ 6k RPS
Consistency: causal (messages in order per conversation)

Step 2 — Architecture Decision
Real-time: WebSocket for bidirectional chat + SSE for AI token streaming
Persistence: Room (messages) + outbox table (unsent) + DataStore (prefs)
Offline: write-local-first, WorkManager drains outbox on reconnect
Sync: delta sync on foreground (lastSyncedAt cursor per conversation)

Step 3 — Component Walkthrough
UI  →  ChatViewModel (StateFlow)
     →  SendMessageUseCase
        ├── MessageRepository.saveLocal()  [Room tx: message + outbox]
        └── MessageRepository.streamAI()   [OkHttp SSE → callbackFlow]
             └── tokens emitted to StateFlow → Compose recomposes
OutboxWorker (WorkManager) reads PENDING → POST /messages → mark SENT
WebSocketManager (singleton) pushes incoming msgs → Room → Flow re-emit

Step 4 — Failure Scenarios (volunteer these)
Network loss mid-send  → outbox PENDING → WorkManager retries on reconnect
SSE stream drops       → callbackFlow close(e) → ViewModel shows retry UI
Server 5xx             → exponential backoff + jitter, max 5 retries
Process death mid-send → outbox row survives in Room; worker resumes
Auth token expired     → OkHttp Authenticator refreshes → replays request

Interview tip: Key tradeoffs to state: WebSocket over SSE (bidirectional needed for chat). Write-local-first over network-first (offline UX is non-negotiable). Causal consistency over strong (ordering per-conversation is sufficient).

Failure Scenarios — Have These Ready

Proactively addressing failure scenarios shows Staff-level thinking.

FailureResponse Strategy
Network loss mid-requestNetworkCallback detects drop → queue in outbox (Room) → WorkManager retries with exponential backoff
Server 5xx errorsExponential backoff + jitter → surface error state in UI after N retries → never hammer a failing server
App crash during writeRoom transactions ensure atomicity → WorkManager resumes outbox on next launch automatically
Partial streaming responseBuffer tokens received → on disconnect, resume from last confirmed token position if API supports cursors
Stale cache served to userETag / Last-Modified headers → background refresh with stale-while-revalidate → never block UI on freshness
Auth token expired mid-requestOkHttp Authenticator intercepts 401 → refreshes access token → replays original request transparently
Out of memory / low memoryonTrimMemory() callback → release L1 cache → Coil/Glide handle bitmap eviction automatically
Push notification not receivedFCM is not guaranteed → implement pull fallback (delta sync on app foreground) as safety net

Tradeoffs Reference

Every decision needs a stated tradeoff. Say: 'I chose X over Y because... The cost is...'

DecisionOption AOption B
API protocolREST — simple, cacheable, mature toolingGraphQL — flexible, no over/under-fetching
Real-time transportWebSocket — bidirectional, low latencySSE — simpler, server-to-client only
Polling strategyLong polling — near real-time, fewer requestsShort polling — simple, wastes bandwidth
Update deliveryPush (FCM) — battery efficient, real-timePull — simple, adds latency, drains battery
Data freshnessCache-first — fast UX, may show stale dataNetwork-first — always fresh, needs connectivity
Offline supportOffline-first — great UX, complex sync logicOnline-only — simple to build, fails without network
ConsistencyStrong — always correct, slower writesEventual — fast, temporary divergence acceptable
PaginationCursor-based — stable under inserts/deletesOffset-based — simple, but skips/duplicates on changes
UI frameworkFull native Compose — best performance, Android onlyKMP — shared logic, native UI per platform
DI scopeSingleton — one instance app-wide, fast accessViewModelScoped — fresh per screen, better isolation

Worked Example — Design a Mobile Payment Flow

Payments interviews at Stripe, Ramp, PayPal, and Google Pay specifically test idempotency, security, and reliability.

Step 1 — Requirements
Functional: initiate payment, confirm status, handle failure, show receipt
Non-functional: exactly-once execution (no double charges), <1s p99 UX
Scale: 1M transactions/day ~ 12 TPS average, 100 TPS peak
Consistency: STRONG — every write to payment state must be durable

Step 2 — Architecture Decisions
Network: HTTPS/REST with TLS 1.3. No WebSocket — request/response is fine.
Idempotency: client generates UUID before sending. Server deduplicates on it.
Token security: card data → Stripe SDK → never touches our servers (PCI scope)
Status polling: POST /payments → 202 Accepted + paymentId → GET /payments/:id
Retry: exponential backoff on 5xx; DO NOT retry on 4xx (non-retryable)

Step 3 — Components
PaymentViewModel
  └── InitiatePaymentUseCase
      ├── TokenizeUseCase  [Stripe SDK — client-side only]
      ├── PaymentRepository.submit(token, idempotencyKey)
      └── PaymentStatusPoller.poll(paymentId) [Flow, 1s intervals, max 30s]

PaymentStateMachine (sealed class):
  IDLE → TOKENIZING → SUBMITTING → POLLING → SUCCESS | FAILED | TIMEOUT

Step 4 — Failure Scenarios
Double-tap submit     → idempotencyKey UUID — server returns same result
Network drop mid-POST → retry with same idempotencyKey → safe to resend
Server 500            → backoff; after 3 attempts show "try again" UI
Timeout (30s poll)    → TIMEOUT state; user contacts support; do NOT retry
App killed during pay → resume from POLLING state via SavedStateHandle
SeniorImplements basic payment flow
StaffDesigns idempotency, retry safety, state machine, and token security scope
PrincipalDefines the payment platform contract — idempotency policy, PCI scope boundaries, fraud signal integration

Interview tip: Key signals: (1) idempotency key is generated client-side before ANY network call, (2) card data never touches your server via a tokenization SDK, (3) payment state is a sealed class — impossible to be in SUCCESS and FAILED simultaneously.

Worked Example — Design an Image Feed (Instagram / Pinterest)

Image feed interviews test cache architecture, progressive loading, scroll performance, and bandwidth optimisation.

Step 1 — Requirements
Functional: infinite scroll feed, images load fast, offline last-seen feed
Non-functional: <200ms image display p99, smooth 60fps scroll, minimal data
Scale: 50M DAU, 200 images/session = 10B image loads/day

Step 2 — Architecture Decisions
Images: Coil with memory + disk cache (OkHttp disk cache)
Feed data: Paging 3 + Room (RemoteMediator for offline-first)
Prefetch: load next page when 3 items from end (PagingConfig.prefetchDistance)
Progressive: load thumbnail first (blurhash), then full-res on display
CDN: serve images from CDN; request correct size via URL params (?w=320&q=80)

Step 3 — Component Walkthrough
FeedViewModel → Pager(config, pagingSource = RoomFeedPagingSource)
RemoteMediator.load() → GET /feed?cursor=X → Room.insertAll() → DB emits
LazyColumn → items(pagingItems, key = { it.id })
  └── AsyncImage(model = ImageRequest.Builder
        .data(post.thumbnailUrl)       // load thumbnail first
        .placeholder(blurhashDrawable)  // instant perceived load
        .crossfade(300)
        .size(ViewSizeResolver(imageView)) // request exact display size
        .build())

Step 4 — Failure Scenarios
No network on launch    → Paging 3 shows Room cache; RemoteMediator retries
Image load fails        → Coil retry(2) + error placeholder
Scroll jank             → @Stable item model + key= in LazyColumn + Macrobenchmark
Memory pressure         → Coil evicts memory cache; disk cache still available
SeniorIntegrates Coil and loads images in a list
StaffDesigns full cache pipeline, CDN size optimization, prefetch strategy, and Paging 3 offline-first
PrincipalOwns image pipeline platform — CDN strategy, adaptive quality, progressive loading standard, bandwidth budget per user tier

Interview tip: Mention: (1) request only the display size from CDN — never download a 4K image for a 300px thumbnail, (2) blurhash/placeholder for perceived performance, (3) Paging 3 + Room RemoteMediator for offline-first feed.

Worked Example — Design Real-Time Ride Tracking (Uber / DoorDash)

Real-time location interviews test background work, battery efficiency, WebSocket reliability, and map rendering.

Step 1 — Requirements
Functional: driver location updates to rider in real-time, ETA updates, route
Non-functional: <3s location update latency, battery-aware, accurate GPS
Scale: 1M active rides at peak, driver updates every 3s ~ 333K location events/s

Step 2 — Architecture Decisions
Driver → Server: WebSocket (bidirectional, persistent, low latency)
Server → Rider: WebSocket push (server fans out to all riders watching driver)
Location: FusedLocationProviderClient PRIORITY_HIGH_ACCURACY in ForegroundService
Interval: 3s during active ride; 30s during pickup/waiting (adaptive)
Battery: PRIORITY_BALANCED_POWER outside geofence radius of pickup

Step 3 — Components
LocationService (ForegroundService)
  └── FusedLocationProviderClient.requestLocationUpdates(3s, 10m)
  └── locationFlow (callbackFlow { } → trySend → awaitClose)
  └── LocationBatcher.buffer(5 updates) → WebSocket.send(batch)

RideViewModel
  └── WebSocketManager.observeDriverLocation()  [SharedFlow, replay=1]
  └── locationState: StateFlow<LatLng>
  └── MapComposable renders GoogleMap with driver marker

ForegroundService notification: "Your driver is en route" (required API 26+)

Step 4 — Failure Scenarios
WebSocket drops        → reconnect with 1s/2s/4s backoff; resend last cursor
GPS signal lost        → fallback to NETWORK provider; show accuracy indicator
App backgrounded       → ForegroundService keeps location running (required)
Battery saver mode     → reduce update interval to 10s; notify user of reduced accuracy
Process death          → ForegroundService auto-restarts via START_STICKY
SeniorImplements basic location tracking
StaffDesigns ForegroundService lifecycle, adaptive intervals, WebSocket reliability, and battery optimisation
PrincipalDefines location platform strategy — accuracy tiers, battery budget, geofencing, multi-modal transport detection

Interview tip: Key signals: (1) ForegroundService is required to continuously receive location updates while backgrounded on Android 10+; ACCESS_BACKGROUND_LOCATION is a separate permission requirement, (2) FusedLocationProviderClient not raw GPS, (3) adaptive interval — fast during ride, slow during wait, (4) WebSocket for bidirectional not SSE (driver also receives route updates).

Common Questions by Company

Know what to expect based on the company you're interviewing with.

CompanyLikely Topics
Google / Android teamBaseline Profiles, Compose stability, modularization at scale, Doze/battery, accessibility
Meta / InstagramFeed rendering at scale, image pipeline, A/B testing architecture, multi-process, offline
NetflixVideo streaming buffering, DRM, download manager, adaptive bitrate, background playback
AirbnbDeep links, offline maps/search, accessibility, i18n (RTL), design system, complex navigation
Uber / DoorDashReal-time location tracking (WebSocket), offline-first, background location, push reliability
Stripe / RampPayment reliability, idempotency, token security, SDK design, certificate pinning
OpenAI / AnthropicLLM token streaming (SSE), reconnection resilience, incremental rendering, latency optimization
Palo Alto NetworksWebView security, certificate pinning, multi-process isolation, root detection, encrypted storage

Test your knowledge

This chapter includes 6 quiz questions covering all core concepts. Open the interactive experience to test yourself.

Start Quiz →