majorstaff2020

Dropbox Android

The Sync Race

Offline edits silently disappeared when users reconnected. Sync reported success. No crash. No error. Just missing work.

The Incident

Dropbox's offline sync capability — a core product promise — developed an intermittent but reproducible bug in 2020. Users who edited files while offline, reconnected to a network, and briefly lost connectivity again before the upload completed would sometimes find their local edits silently replaced by the server version. The sync engine used optimistic locking with a server-side revision counter, but a race between the job that marked local changes as dirty and the job that fetched remote state created a window in which local edits could be discarded without any error being raised.

Evidence from the Scene

  • Offline edits were occasionally replaced by the older server version after reconnecting
  • The bug only reproduced when connectivity changed twice within a short window — e.g., Wi-Fi → mobile data → Wi-Fi
  • Sync logs showed the operation as 'success' — no error was recorded on the device
  • Adding artificial network delays in internal testing made the bug reproduce on every attempt
  • The bug was confirmed on both Android and iOS, ruling out an Android-specific implementation issue

The Suspects

3 of these are the real root causes. The others are plausible-sounding distractors.

Conflict resolution using server wall-clock timestamp instead of a logical sequence number or vector clock

Race between the 'mark local changes dirty' job and the 'fetch remote state' job running concurrently without a serialization lock

The 'dirty' flag marking unsynchronized local changes stored only in memory — lost on process restart or connectivity event

Room database writes not wrapped in transactions, causing partial updates visible mid-write

Android killing the sync WorkManager worker before the upload network request completed

Upload retry logic using fixed intervals instead of exponential backoff, causing network congestion on reconnect

The Verdict

Real Root Causes

  • Conflict resolution using server wall-clock timestamp instead of a logical sequence number or vector clock

    Server timestamps cannot reliably establish causal order for offline edits. If the server's clock drifts or two edits arrive from different devices at similar times, the timestamp comparison can incorrectly resolve a conflict in favor of the older version. A Lamport timestamp or vector clock tracks causal ordering regardless of wall-clock time.

  • Race between the 'mark local changes dirty' job and the 'fetch remote state' job running concurrently without a serialization lock

    If the fetch-remote-state job reads the server version and schedules an overwrite between the moment local changes are marked dirty and the moment they are uploaded, the overwrite runs without knowledge of the pending local changes. A serialized job queue or mutex prevents this specific interleaving.

  • The 'dirty' flag marking unsynchronized local changes stored only in memory — lost on process restart or connectivity event

    If the dirty flag is held only in memory, a process restart or aggressive background kill can clear it before the upload completes. The sync engine then treats a locally-modified file as clean and eligible for server-side replacement. Persisting the dirty flag atomically with the edit — using a Room transaction — prevents this data loss.

Plausible But Wrong

  • Room database writes not wrapped in transactions, causing partial updates visible mid-write

    Non-transactional Room writes can cause partial visibility, but the symptom described is a complete, silent replacement of local edits — not partial corruption. The root cause is in the sync protocol logic, not Room write semantics.

  • Android killing the sync WorkManager worker before the upload network request completed

    A killed worker would cause an upload failure and a retry — not a silent successful overwrite. The clue that 'sync reported success' rules out a process-kill scenario.

  • Upload retry logic using fixed intervals instead of exponential backoff, causing network congestion on reconnect

    Retry strategy affects upload reliability under congestion — not which version semantically wins a conflict. The overwrite happens even on a clean first upload attempt, ruling out retry-related contention.

Summary

Dropbox's sync race was a distributed systems problem manifesting in a mobile client. The sync engine treated the server as the authoritative source of truth without a mechanism to distinguish 'local changes the server hasn't seen yet' from 'no local changes' during a conflict. The fix required three coordinated changes: (1) persisting the dirty flag atomically with the edit in a Room transaction, (2) serializing fetch-remote and mark-dirty jobs through a single WorkManager queue, and (3) replacing server timestamps with a logical sequence number for conflict detection. Dropbox's 'Nucleus' sync architecture, documented in their engineering blog, was built around these guarantees.

The Real Decision That Caused This

Designing the sync protocol with in-memory dirty state and server timestamps for conflict resolution — creating a race condition window that was invisible in standard testing where network transitions are clean and instantaneous.

Lesson Hint

Chapter 5 (Data & Persistence) covers offline-first sync architecture, conflict resolution strategies, and why logical clocks matter for multi-device consistency. Chapter 6 (Concurrency) covers race conditions, mutex patterns, and why shared mutable state must be protected even in background workers.

Want to test yourself before reading the verdict?

Open Interactive Case in Autopsy Lab