Chapter 13

Engineering with AI Agents

Specification · Trust · Review · Agent-Safe Architecture · Team Norms

Ready to practise interactively?

Explore this chapter with quizzes, diagrams, and real-world examples in the full interactive experience.

The Specification Discipline

In an AI-augmented workflow, the spec is the primary engineering output — not the code. The quality of what agents produce is strictly bounded by the quality of what you describe. Vague specs produce plausible-looking code that fails in production. Precise specs produce code that is correct by construction within the described constraints. A production-grade spec has five components: (1) Clear inputs and their constraints — what types, what ranges, what formats, what invariants hold before this code runs. (2) Explicit invariants — what must always be true at the end of every execution path, including error paths. (3) Failure modes — what happens when inputs are invalid, the network is down, the database is locked, the disk is full, the user cancels mid-operation. (4) Performance envelope — what latency, throughput, and memory consumption is acceptable under normal load and under peak load. (5) Integration contracts — what does this code call, what calls this code, what guarantees does it provide to its callers. For Android specifically: specs for WorkManager tasks must include the retry policy (linear vs exponential, max attempts), the constraint requirements (network type, charging state, storage not low), whether the task is idempotent (what happens if it runs twice?), and whether it uses setExpedited or runs as a regular task. Specs for Compose components must include what state they own vs what they observe from outside, what user events trigger state changes, what side effects they produce (LaunchedEffects, SideEffects, DisposableEffects), and what triggers recomposition.

The spec is the primary engineering artifact in an AI workflow — the code is a downstream output of the spec
A good spec covers inputs and constraints, invariants, failure modes, performance envelope, and integration contracts
Vague specs are the root cause of plausible-but-wrong AI output — agents can only work with what's written
Android WorkManager specs must include retry policy, constraint requirements, and idempotency guarantees
Compose component specs must define owned vs observed state, recomposition triggers, and side effects produced

// BAD SPEC — vague, describes what to build, not what it must guarantee
/**
 * Syncs user data with the server.
 * Retries on failure.
 */
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {
    override suspend fun doWork(): Result {
        // Agent generates: catches Exception, retries blindly, no idempotency
        return try {
            api.syncUser()
            Result.success()
        } catch (e: Exception) {
            Result.retry()
        }
    }
}

// GOOD SPEC — precise invariants, failure modes, idempotency, performance envelope
/**
 * Syncs the authenticated user's profile with the remote server.
 *
 * PRECONDITIONS:
 *   - Network is available (enforced by WorkManager NETWORK constraint)
 *   - User is authenticated (userId stored in DataStore is non-null)
 *
 * INVARIANTS:
 *   - Must be idempotent: running twice must produce the same server state as running once
 *   - Must propagate CancellationException — never catch it or return Result.retry() on cancellation
 *   - Must not retain the WakeLock beyond the doWork() call — WorkManager manages it
 *
 * FAILURE MODES:
 *   - HttpException 4xx: not retryable — return Result.failure() with error in outputData
 *   - HttpException 5xx: retryable — return Result.retry() (WorkManager uses exponential backoff)
 *   - IOException (network): retryable — return Result.retry()
 *   - AuthException: not retryable — clear token, return Result.failure(), emit logout event
 *   - User cancellation: propagate CancellationException — let WorkManager handle cleanup
 *
 * PERFORMANCE:
 *   - Must complete within 10 minutes (WorkManager hard limit)
 *   - Must not block the calling coroutine longer than 30s waiting for network response
 *   - Must not allocate more than 4 MB for the sync payload
 *
 * INTEGRATION:
 *   - Calls: UserApi.syncProfile(userId, lastSyncTimestamp)
 *   - Reads: DataStore for userId and lastSyncTimestamp
 *   - Writes: DataStore with new lastSyncTimestamp on success
 *   - Callers: Scheduled via WorkManager from UserRepository.schedulePeriodic sync()
 */
class SyncWorker(context: Context, params: WorkerParameters) : CoroutineWorker(context, params) {
    override suspend fun doWork(): Result {
        // Agent generates correct implementation bounded by the spec above
        val userId = dataStore.data.first().userId ?: return Result.failure()
        return try {
            val response = api.syncProfile(userId, dataStore.data.first().lastSyncTimestamp)
            dataStore.updateData { it.copy(lastSyncTimestamp = response.syncedAt) }
            Result.success()
        } catch (e: HttpException) {
            if (e.code() in 400..499) Result.failure() else Result.retry()
        } catch (e: AuthException) {
            dataStore.updateData { it.copy(userId = null, authToken = null) }
            Result.failure(workDataOf("error" to "auth_expired"))
        } catch (e: IOException) {
            Result.retry()
        }
    }
}

Interview tip: Interviewers at AI-era companies are increasingly asking candidates to write a spec before writing code. Practice describing invariants, failure modes, and integration contracts before your hands touch the keyboard.

The Trust Matrix

Not all tasks are equal candidates for AI delegation. The right mental model is a two-dimensional trust matrix built on two questions: (1) How verifiable is the output — can you tell, quickly and reliably, whether the generated code is correct? (2) What is the blast radius if it is wrong — is the failure recoverable without user harm, or does it cause data loss, security breach, or financial damage? High verifiability combined with low blast radius is the safe delegation zone. You can verify the output is correct (tests pass, visual review is sufficient, behavior is deterministic) and if something slips through, the failure is caught quickly and fixed cheaply. Low verifiability combined with high blast radius is the human-only zone. You cannot easily tell if the logic is subtly wrong, and when the failure surfaces it is catastrophic or irreversible. The matrix creates four quadrants: Delegate freely (high verifiability + low blast radius), Delegate with adversarial review (high verifiability + high blast radius), Delegate with skepticism (low verifiability + low blast radius), and Human only (low verifiability + high blast radius). For Android teams, the trust matrix helps set concrete policies: test scaffolding is delegate-freely because tests either pass or fail and a bad test causes no production impact. Retry logic for payment API calls is human-only because subtle off-by-one errors in backoff calculations don't surface in mocked tests and the blast radius is duplicate charges to real users.

The trust matrix has two axes: verifiability (can you tell if it's wrong?) and blast radius (what happens if it is?)
High verifiability + low blast radius = safe to delegate; low verifiability + high blast radius = human only
Verifiability is not the same as test coverage — it's whether tests can actually surface the failure mode of interest
Blast radius determines the cost of a mistake reaching production — recoverable cosmetic vs irreversible financial damage

Task	Verifiability	Blast Radius	Trust Level
Test scaffolding / characterization tests	High — tests pass or fail	Low — caught before production	Delegate freely
Boilerplate (adapters, data classes, mappers)	High — compile-time + unit test	Low — no runtime behavior	Delegate freely
Compose UI layout / styling	Medium — visual review required	Low — cosmetic only	Delegate with review
Room DAO methods for non-critical data	High — instrumented tests cover it	Low — user can retry	Delegate with review
WorkManager retry/backoff for analytics	Medium — hard to test all retry paths	Low — analytics not user-critical	Delegate with skepticism
Retry logic for payment API calls	Low — edge cases don't surface in mocked tests	High — duplicate charges to users	Human only
Auth token refresh flow	Low — race conditions rare in tests	High — security breach or account lockout	Human only
Sync conflict resolution for offline-first data	Low — correctness requires deep invariant review	High — data loss or corruption	Human only

SeniorKnows which tasks are safe to delegate for their own work and applies the matrix intuitively

StaffCan define the trust matrix explicitly for their team, write it into team norms, and use it to triage incoming work

PrincipalBuilds the trust matrix into org-level tooling — CI checks that flag high-blast-radius paths, review checklists, and onboarding materials that teach the framework to new engineers

AI Code Review Heuristics

AI-generated code has a specific failure profile: it is plausible, it compiles, it passes obvious tests, and it is wrong in non-obvious ways. The failure modes cluster into five categories that experienced reviewers should actively check for. (1) Error handling — AI commonly catches Exception or Throwable at too broad a level, swallows CancellationException (which must always propagate in coroutines), or silently drops errors that should propagate to the caller. The code looks like it handles errors because it has try/catch blocks everywhere, but the handling is wrong. (2) Concurrency — AI generates race conditions that pass single-threaded tests. Common patterns: shared mutable state (a var or MutableList) accessed from multiple coroutines without synchronization, StateFlow updated from a coroutine scope that can outlive the ViewModel, LaunchedEffect with a wrong or missing key that causes it to relaunch on every recomposition. (3) Resource management — AI forgets to close resources. Streams that are opened but never closed, WakeLocks that are acquired but never released, Jobs that are started but never cancelled when the scope ends, BroadcastReceivers registered in onStart but not unregistered in onStop. (4) Security — AI uses insecure defaults: cleartext HTTP when HTTPS is available, SharedPreferences for sensitive data instead of EncryptedSharedPreferences, missing input validation before database queries, missing permission checks before sensitive operations. (5) Edge cases — AI handles the happy path correctly but not: empty collections where the code assumes at least one element, null in a non-null context that the type system doesn't catch, network timeout mid-operation where the app is left in a partial state, database locked during a write where the operation silently fails.

Error handling: check for overly broad catches, swallowed CancellationException, and silently dropped errors
Concurrency: check for shared mutable state without synchronization, StateFlow updated from wrong scope, LaunchedEffect with wrong key
Resource management: check for unclosed streams, unreleased WakeLocks, uncancelled Jobs, unregistered receivers
Security: check for cleartext HTTP, sensitive data in SharedPreferences, missing permission checks, missing input validation
Edge cases: check for assumptions about non-empty collections, missing null handling, partial state after timeout, silent DB failures

// FAILURE MODE 1: Error handling — swallowed CancellationException
suspend fun fetchUser(id: String): User? {
    return try {
        api.getUser(id)
    } catch (e: Exception) {  // BUG: catches CancellationException — coroutine won't cancel
        null
    }
}
// FIX: catch specific exceptions or re-throw CancellationException
suspend fun fetchUser(id: String): User? {
    return try {
        api.getUser(id)
    } catch (e: CancellationException) {
        throw e  // always propagate
    } catch (e: IOException) {
        null
    }
}

// FAILURE MODE 2: Concurrency — shared mutable state without synchronization
class UserCache {
    private val cache = mutableMapOf<String, User>()  // BUG: not thread-safe
    suspend fun get(id: String) = cache[id]
    suspend fun put(id: String, user: User) { cache[id] = user }
}
// FIX: use a Mutex or a concurrent data structure
class UserCache {
    private val mutex = Mutex()
    private val cache = mutableMapOf<String, User>()
    suspend fun get(id: String) = mutex.withLock { cache[id] }
    suspend fun put(id: String, user: User) = mutex.withLock { cache[id] = user }
}

// FAILURE MODE 3: Resource management — stream never closed
suspend fun readFile(uri: Uri): String {
    val stream = contentResolver.openInputStream(uri)  // BUG: never closed on exception
    return stream!!.bufferedReader().readText()
}
// FIX: use use() to close on all exit paths
suspend fun readFile(uri: Uri): String {
    return contentResolver.openInputStream(uri)?.use { stream ->
        stream.bufferedReader().readText()
    } ?: ""
}

// FAILURE MODE 4: Security — sensitive data in SharedPreferences
fun saveToken(token: String) {
    prefs.edit().putString("auth_token", token).apply()  // BUG: unencrypted storage
}
// FIX: use EncryptedSharedPreferences
fun saveToken(token: String) {
    encryptedPrefs.edit().putString("auth_token", token).apply()
}

// FAILURE MODE 5: Edge case — assumes non-empty list
fun getLatestMessage(messages: List<Message>): Message {
    return messages.last()  // BUG: throws NoSuchElementException on empty list
}
// FIX: handle the empty case explicitly
fun getLatestMessage(messages: List<Message>): Message? {
    return messages.lastOrNull()
}

Interview tip: Staff-level candidates are expected to review code with an adversarial mindset — assume the code is plausible but wrong in edge cases and prove it correct, rather than assuming it's correct and looking for obvious bugs.

Agent-Safe Architecture

Some codebases are significantly easier for agents to work within than others. Agent-safe architecture shares characteristics with good architecture in general — but with different emphasis driven by how agents work: from what is written, not from what is understood by the team. The five key properties of agent-safe architecture: (1) Explicit contracts — interfaces with documented preconditions and postconditions, not implicit conventions. When a human engineer understands an implicit convention (e.g., 'you always call init() before use()'), they apply it consistently. Agents work from what is written in the code and comments. If the contract is not documented, agents violate it. (2) Small, composable components — agents produce significantly better output within bounded scope. A 500-line class with 12 dependencies provides too much context to navigate correctly; agents produce incoherent output that satisfies some dependencies while silently violating others. A 50-line class with 2 dependencies produces correct output reliably. Small components are not just good engineering — they are a prerequisite for effective agent delegation. (3) Strong types — sealed classes, value objects, and type aliases that make invalid states unrepresentable. Agents cannot violate type constraints enforced by the compiler. If your domain model uses String for userId, agents will pass userNames where userIds are required. If your domain model uses a value class UserId, agents cannot make that mistake. (4) Comprehensive tests that cover invariants — agents write code that passes the tests they are shown. If tests only cover happy paths, agents produce code that passes happy paths. Tests that cover invariants — 'the result must always satisfy X regardless of input' — produce agents that maintain those invariants. (5) Architectural Decision Records — when agents write code, the 'why' is not captured anywhere. Agent-safe codebases use ADRs to record key decisions so that the next engineer (and the next agent session) can understand the system without reconstructing the reasoning from the code.

Explicit contracts (documented preconditions and postconditions) prevent agents from violating implicit conventions they cannot read
Small, composable components (50 lines, 2 dependencies) produce reliable agent output; large classes produce incoherent output
Strong types (sealed classes, value objects, UserId not String) let the compiler prevent agent type errors at compile time
Invariant-covering tests produce agents that maintain invariants; happy-path-only tests produce agents that pass happy paths only
Architectural Decision Records preserve the 'why' behind decisions that agents cannot capture

SeniorBuilds individual components with agent-safe properties: documented contracts, small scope, strong types

StaffEvaluates whether an entire system is agent-safe before delegating to agents, and refactors to improve agent-safety where it is low

PrincipalMakes agent-safe architecture the default for new systems across the org — writes lint rules, establishes ADR templates, and sets module size guidelines that enforce agent-safe boundaries

Team Norms for AI-Augmented Teams

When AI coding tools enter a team without deliberate norms, three predictable failure modes emerge. Staff+ engineers are responsible for preventing all three before they appear. Failure mode 1: The velocity trap. Output volume triples but understanding doesn't keep pace. Engineers ship code they cannot explain during an incident. When production breaks at 2am, nobody can reason about what the code is supposed to do or why it does what it does. The team has traded sustainable velocity for a fragile productivity spike. Failure mode 2: The adoption divide. Some engineers adopt AI tools enthusiastically and deliver dramatically more output. Others refuse and maintain their previous pace. The team fragments: the adopters feel the refusers are holding the team back; the refusers feel the adopters are shipping code nobody understands. Resentment builds and the team loses coherence. Failure mode 3: The ownership gap. AI-generated features that nobody fully owns. When something breaks, engineers point to 'the AI wrote it' as a reason not to engage. Ownership is a social contract, not a technical fact — but it breaks down when engineers feel no authorship. The Staff move is to establish four norms before the failure modes appear: Norm 1: AI-generated code must be understood by a human before it merges. The author cannot merge code they cannot explain. This prevents the velocity trap. Norm 2: The author is responsible for explaining every non-trivial decision in the PR, regardless of how the code was generated. This norm preserves ownership — 'the AI wrote it' is not a valid PR description. Norm 3: High-risk paths (payments, authentication, sync, privacy) require human authorship or adversarial review by a domain expert. This is the trust matrix operationalized as a team norm. Norm 4: AI-generated features must ship with observability. If you cannot tell when it fails in production, it is not ready to ship. Log the error paths, add metrics on the retry counts, alert on the failure rate.

The velocity trap: output triples but understanding doesn't — engineers ship code they cannot debug or explain under pressure
The adoption divide: early adopters and refusers fragment the team — resentment builds without deliberate norms to bridge the gap
The ownership gap: AI-generated code with no human owner — 'the AI wrote it' becomes a reason not to engage during incidents
Norm 1: AI code must be understood by the author before it merges — prevents the velocity trap
Norm 2: The author explains every non-trivial decision in the PR regardless of generation method — preserves ownership
Norm 3: High-risk paths require human authorship or adversarial review — operationalizes the trust matrix
Norm 4: AI-generated features ship with observability — you must be able to tell when they fail in production

SeniorFollows the team's AI norms and advocates for them within their scope of influence

StaffDefines the team's AI norms before failure modes appear, including specific policies for high-risk paths and observability requirements

PrincipalSets org-level AI engineering principles that teams can adapt to their context — risk-based principles that hold across different tools, languages, and team compositions

Interview tip: Principal candidates are increasingly being asked: 'How would you set AI tooling standards for your team?' Prepare a concrete answer grounded in risk categories, not tool preferences.

The Interview Dimension

AI tools have created an entirely new category of interview question that no existing preparation resource covers well. Companies interviewing Staff and Principal candidates are now asking directly about AI tooling philosophy — and the answers reveal judgment more clearly than many traditional technical questions. The specific questions appearing at Staff+ interviews: 'How do you think about AI tooling for your team?' — this is a Staff-level norms question. 'What would you delegate to agents and what would you keep human?' — this is the trust matrix question. 'Tell me about a time AI-generated code caused a production issue on your team or a team you know of, and what you learned from it' — this is a behavioral question testing for calibrated skepticism and learning. The wrong answers are clear in both directions. 'I don't use AI tools — I prefer to write everything myself for quality control' signals that you are behind the curve on tooling that the industry has adopted at scale. Interviewers interpret this as inflexibility or as a candidate who doesn't update their practices. Equally wrong: 'AI handles most of my implementation and I review the output' signals that you have no calibrated judgment about when AI delegation is safe and when it is dangerous. Both answers fail the same test: they are unqualified, undifferentiated positions. The right answer demonstrates calibrated judgment: specific tasks you delegate (test scaffolding, boilerplate, characterization tests for legacy code), specific tasks you don't delegate (payment logic, auth flows, sync conflict resolution), and a clear principle that separates them — verifiability and blast radius. The answer should be concrete enough that the interviewer can ask 'what about X?' and get a reasoned response, not a blanket rule. For presenting AI-assisted work on a resume and in interviews: be transparent about it. Claiming you wrote code that an agent generated in a technical interview creates integrity risks. The right framing is: 'I used AI tooling to generate the initial implementation, wrote the spec that bounded what it produced, and adversarially reviewed and corrected the output.' This is honest, demonstrates a complete workflow, and signals exactly the judgment interviewers are looking for.

Staff+ interviews now include AI tooling philosophy questions — prepare a concrete, calibrated answer before your next interview
The wrong answer is unqualified in either direction: refusing AI tools signals inflexibility; uncritical adoption signals no judgment
The right answer names specific delegation decisions, the principle that separates them (verifiability + blast radius), and lessons from when it went wrong
Be transparent about AI-assisted work in interviews — describe your workflow (spec → delegation → adversarial review) honestly
Behavioral AI questions ('tell me about a time AI-generated code caused an issue') test for calibrated skepticism and learning from failure

SeniorCan articulate their personal AI tooling philosophy with specific examples of what they delegate and what they don't

StaffCan articulate how they would set team standards for AI tooling, grounded in risk categories and the trust matrix

PrincipalCan articulate how they would shape org-level AI engineering culture — principles that hold across different tools, teams, and contexts, and how they would evolve those principles as tooling changes

Interview tip: When asked about AI tooling, demonstrate calibrated judgment: name specific tasks you delegate (test scaffolding, boilerplate), specific tasks you don't (payment logic, auth), and the principle that separates them (verifiability + blast radius). Generic answers in either direction signal shallow thinking.