Chapter 10
Scale & Cross-Platform
SDK Design · Design Systems · Accessibility · i18n · KMP · Image Pipelines
Ready to practise interactively?
Explore this chapter with quizzes, diagrams, and real-world examples in the full interactive experience.
SDK / Library Design
Expected at Stripe, Palo Alto Networks, Anthropic, and any company shipping a developer-facing SDK.
- Minimal API surface — expose only what is necessary; every public API is a contract you must maintain
- Backward compatibility — use @Deprecated with replacement; never remove public APIs in minor versions
- Semantic versioning — MAJOR.MINOR.PATCH; breaking changes = major bump
- Binary compatibility — use Binary Compatibility Validator (Kotlin) to catch ABI breaks in CI
- ProGuard consumer rules — ship consumer-rules.pro so consumers don't need to add keep rules
- Initialization — support both manual init and auto-init via ContentProvider (like Firebase)
- Avoid leaking internal types — internal classes must not appear in public API signatures
Design System Architecture
L8+ engineers are expected to think about design systems, not just use them.
- Token-based theming — define semantic tokens (colorPrimary, spacingMd, typographyHeadline) not raw values; map tokens to platform values per theme
- MaterialTheme extension — extend Compose MaterialTheme with custom tokens via CompositionLocal
- Component versioning — breaking design changes = new component (Button vs ButtonV2) until migration is complete
- Multi-theme support — light, dark, high-contrast; tokens map differently per theme
- Shared across platforms — with KMP, design tokens can be shared; platform renders natively
Accessibility (a11y)
Google tests this at every level. Airbnb has one of the strongest a11y cultures in the industry.
- Content descriptions — every icon, image, and non-text element needs a meaningful description
- Semantic properties in Compose — use Modifier.semantics { } to provide role, state, and actions to TalkBack
- Touch target size — minimum 48x48dp; use Modifier.minimumInteractiveComponentSize()
- Color contrast — 4.5:1 for normal text; 3:1 for large text (WCAG AA standard)
- Focus order — ensure TalkBack traversal order matches visual order; use isTraversalGroup and traversalIndex to correct
- Screen reader testing — test with TalkBack on a real device before shipping
Internationalization (i18n) & Localization
Building apps for a global audience.
- RTL layout support — use start/end not left/right; test with Arabic or Hebrew locale; Compose handles RTL automatically
- Plurals — use plurals resource type; never concatenate strings for counts
- String formatting — use getString(R.string.x, arg); never concatenate translated strings with hardcoded text
- Locale-aware formatting — dates, times, currencies must use system locale; never hardcode format strings
- Pseudo-localization — enable in developer options to catch layout truncation and hardcoded strings early
- Font scaling — test at 200% font size; use sp for text; ensure layouts don't break
Scalable Image Pipeline
Beyond 'use Coil'. Relevant at Instagram, Airbnb, Netflix — any image-heavy app.
- Memory cache (L1) — in-memory LruCache keyed by URL + size; bounded by available RAM
- Disk cache (L2) — DiskLruCache; keyed by URL hash; bounded by configured disk quota
- Transformations — resize, crop, circle-crop applied before caching; cache stores transformed result not original
- Priority queuing — visible items load first; prefetch off-screen items at lower priority
- Animated images — WebP preferred over GIF (smaller); use Coil's AsyncImage with enableAnimatedImage = true
- Placeholder strategy — dominant color placeholder (Airbnb Lottie pattern) vs blur hash vs skeleton; all better than blank space
Kotlin Multiplatform (KMP) — Staff-Level Deep Dive
KMP shares business logic across Android, iOS, desktop, and web while keeping UI fully native. In 2025–2026 this is increasingly in scope for Staff-level platform strategy discussions at companies like Netflix, Touchlab clients, and any team that ships on both mobile platforms.
- Shared code targets: business logic, domain UseCases, data models, repository interfaces, validation, network client (Ktor), local storage (SQLDelight), and analytics events
- Native-only code: UI layer (Compose Multiplatform on Android, SwiftUI on iOS), platform APIs (camera, biometrics, BLE, push, file system), and navigation stacks
- expect / actual — declare an interface in commonMain with expect; each platform provides an actual implementation; used for platform-specific clocks, UUID generation, crypto, file IO
- Ktor for KMP networking — multiplatform HTTP client; uses OkHttp engine on Android, Darwin (NSURLSession) on iOS; same coroutine-based API across platforms; serialization via kotlinx.serialization
- SQLDelight for KMP persistence — generates type-safe Kotlin APIs from SQL; SQLite on Android/iOS/JVM; multiplatform transactions, migrations, reactive queries via coroutines
- Compose Multiplatform (CMP) — Jetbrains extension of Jetpack Compose that targets Android, iOS (Beta), Desktop, Web; shares UI code beyond just logic; appropriate when team wants near-100% code share
- Module structure — :shared (commonMain + androidMain + iosMain) produces an Android AAR and an iOS Framework (XCFramework); iOS team consumes via CocoaPods or Swift Package Manager
- KMP vs Flutter — KMP: native rendering, native feel, existing team skills; Flutter: single codebase including UI, own rendering engine (non-native feel), strong for new products
- KMP vs React Native — KMP keeps native UI, RN uses JS bridge or JSI; KMP is better for performance-critical paths; RN is better for web-to-mobile teams
- When to recommend KMP — existing Android/iOS teams, complex business logic that must stay in sync (e.g. pricing rules, validation), gradual adoption possible (start with one UseCase)
- When NOT to recommend KMP — small team with only Android engineers, timeline pressure, UI-heavy app where shared logic savings are minimal, or CMP iOS Beta stability is unacceptable
| Layer | Shared (commonMain) | Native (androidMain / iosMain) |
|---|---|---|
| Network client | Ktor HttpClient (platform engine injected) | OkHttp engine (Android), Darwin engine (iOS) |
| Persistence | SQLDelight queries & migrations | SQLiteDriver (Android), NativeSqliteDriver (iOS) |
| Business logic | UseCases, domain models, validation | — |
| Platform APIs | expect declarations | actual: Camera, Biometrics, Push, BLE |
| UI | Compose Multiplatform (optional) | Compose (Android), SwiftUI (iOS) |
| DI | Koin (multiplatform) | Hilt (Android only, if not using Koin) |
Recommended Libraries
- Ktor — Multiplatform async HTTP client. OkHttp engine on Android, Darwin on iOS. Coroutine-based. Best for KMP networking.
- SQLDelight — Generates type-safe Kotlin from SQL. Multiplatform SQLite. Reactive queries via coroutines. Standard for KMP persistence.
- Koin — Lightweight DI framework with multiplatform support. Works in commonMain. Alternative to Hilt for KMP projects.
- Compose Multiplatform — JetBrains extension of Compose for Android, iOS (Beta), Desktop, Web. Shares UI across platforms.
- kotlinx.serialization — Multiplatform JSON/Protobuf serialization. No reflection. Works in commonMain alongside Ktor.
ML/AI on Android — On-Device Inference
TFLite, ML Kit, and on-device LLM inference are Staff-level topics at most major Android shops in 2025. You are expected to reason about when to run inference on-device vs server-side, and what the performance and privacy trade-offs are.
- TensorFlow Lite (TFLite) — run quantized ML models on-device; no network required; model bundled in assets or downloaded via Firebase Model Delivery
- ML Kit — Google's on-device ML SDK; pre-built models for text recognition, face detection, barcode scanning, translation; wraps TFLite; zero ML expertise required
- NNAPI (Neural Networks API) — Android hardware abstraction layer for ML; routes inference to GPU, DSP, or NPU when available; TFLite and ML Kit use it automatically
- Model quantization — INT8 quantization reduces model size 4x and speeds up inference 2–4x; quality loss is typically <1% for vision models; required for mobile deployment
- INT8 vs FP16 — INT8 is faster on NNAPI/NPU, uses less memory; FP16 retains more precision; FP32 is full precision training format — never ship FP32 to mobile
- On-device vs server inference — on-device: no latency, no cost per call, privacy-preserving, works offline, but limited model size; server: larger models, always up to date, but adds RTT and cost
- MediaPipe — Google's on-device ML framework for real-time pipelines (pose estimation, hand landmarks, face mesh); hardware accelerated; multiplatform
- Firebase ML — model hosting with versioning; A/B test model versions; deliver model updates OTA without app release
- On-device LLM — Google AI Edge (formerly LiteRT) runs Gemma 2B/7B on Pixel 8+ NPU; MediaPipe LLM Inference API; typical token throughput: 20–40 tok/s on Pixel 8 Pro
| Approach | Model Size | Latency | Privacy | Use When |
|---|---|---|---|---|
| ML Kit (pre-built) | Built-in / ~10MB | <10ms most tasks | On-device, no data leaves | Barcode, face, OCR, translation — standard tasks |
| TFLite custom model | 0.5MB–50MB quantized | 10–200ms | On-device | Custom classification, NLP, anomaly detection |
| MediaPipe | Varies | Real-time (camera) | On-device | Pose, hand, face tracking in live video |
| On-device LLM (Gemma 2B) | ~1.5GB INT4 | 20–40 tok/s on Pixel 8 Pro | On-device | Chat, summarization without server cost |
| Server inference (Gemini API) | Unlimited | 100–300ms + RTT | Data sent to server | Complex reasoning, large context, latest model |
Recommended Libraries
- ML Kit — Google's on-device ML SDK. Pre-built models for text, face, barcode, translation. No ML expertise needed.
- TensorFlow Lite — Run quantized TF models on-device. NNAPI/GPU delegate for acceleration. Flexible for custom models.
- MediaPipe — Real-time on-device ML pipelines. Pose, hand, face, object detection. Multiplatform, hardware accelerated.
- Google AI Edge (LiteRT) — On-device LLM inference. Runs Gemma 2B/7B on NPU. MediaPipe LLM Inference API.
- Firebase ML — Host, version, and A/B test TFLite models. Deliver model updates OTA without app release.
Interview tip: When asked about ML features, immediately frame it as a make-vs-buy and on-device-vs-server decision. ML Kit for standard tasks (barcode, OCR) is almost always right — built-in, maintained by Google, zero cost per call. Custom TFLite is warranted when no pre-built model covers your use case. Server inference is warranted when model quality matters more than latency and privacy. Saying 'it depends on privacy requirements and offline needs' scores Staff-level points.
GenAI Integration Patterns for Android
In 2025, integrating LLMs into Android apps is a Staff-level expectation at FAANG and most tier-1 shops. The patterns differ significantly from standard API calls — streaming responses, token budgets, on-device vs server routing, and prompt security all apply.
- Streaming responses — LLMs emit tokens, not complete responses; use SSE or streaming HTTP to render progressively; prevents 5–30 second blank-screen wait
- Token streaming to Android — Gemini API supports streamGenerateContent; parse Server-Sent Events; append each token to a StateFlow<String>; Compose LazyColumn auto-scrolls as text grows
- Prompt injection — user input that attempts to override system prompt instructions; mitigate by never concatenating user content directly into system prompts; use role-separated message format
- Context window budget — LLMs have token limits (Gemini Flash: 1M tokens, but cost scales); send only relevant context; summarize conversation history beyond N turns
- On-device LLM routing — use Gemma 2B on-device for short, privacy-sensitive tasks; route to Gemini server API for complex reasoning; routing decision can be heuristic (message length, topic classification)
- Grounding — LLMs hallucinate; ground responses with retrieved context (RAG); for Android: fetch user's relevant data before prompt; include it explicitly in the prompt as context
- Function calling — Gemini/Claude support structured function call responses; parse the JSON response to trigger native Android actions (open camera, make payment) from LLM output
- Rate limiting and cost control — LLM API calls cost money per token; implement per-user rate limits; debounce streaming calls; cache identical prompts (semantic caching if needed)
Interview tip: For any AI feature question, structure your answer around: (1) on-device vs server — privacy and latency, (2) streaming vs batch — user experience, (3) prompt security — injection prevention, (4) cost control — token budget and caching. These four dimensions show Staff-level thinking.
API Gateway & Edge Architecture
Staff engineers reason about the full request path, not just what happens inside the app.
- CDN — serve static assets and cacheable API responses from edge nodes; reduces origin load and latency globally
- BFF (Backend For Frontend) — a gateway layer tailored to mobile; aggregates multiple service calls into one mobile-optimised response; reduces round trips and over-fetching
- Rate limiting at gateway — protects origin from DDoS and runaway clients; return 429 with Retry-After header; client should respect it
- Edge auth — validate JWT at edge before request reaches origin; fail fast, save origin compute
Mobile App
│ HTTPS
▼
CDN / Edge Cache (CloudFront, Fastly)
├── cache static assets, API responses with Cache-Control
├── edge auth, rate limiting, geo-routing
│
▼
API Gateway (Kong, AWS API GW, custom)
├── auth token validation (JWT verify)
├── rate limiting per user/IP
├── request aggregation / BFF (Backend For Frontend)
├── protocol translation (REST → gRPC to internal services)
│
▼
Service Layer (microservices / monolith)
│
▼
Data Stores (DB, cache, object store)Cost Awareness
Principal engineers think in cost. Every architectural choice has a dollar cost at scale.
| Decision | Cheaper Option | More Expensive Option | Cost Driver |
|---|---|---|---|
| Real-time transport | SSE — stateless HTTP, scales with standard infra | WebSocket — requires sticky sessions or connection broker | Server connection state |
| Data format | Protobuf — 3-10x smaller payload | JSON — verbose | Egress bandwidth at 50M DAU |
| Update delivery | Push (FCM) — server pushes only on change | Polling — client hits server every N seconds regardless | Origin server compute + DB reads |
| Caching | CDN edge cache — serve from edge, zero origin cost | No cache — every request hits origin | Origin compute + DB cost |
| Image storage | WebP at CDN — compressed, edge-served | Original PNG served from origin | Storage + egress |
| Search | Client-side filter on cached list | Server search on every keystroke | Server compute + DB query cost |