criticalsenior2018

Slack Android

The ANR Epidemic

Slack's Android app generated thousands of ANR reports daily. Years of patches failed to fix it. The only solution was to rewrite from scratch.

The Incident

Slack's Android app had accumulated significant technical debt by 2018. The app frequently presented Android's 'Application Not Responding' dialog — indicating the main thread was blocked for more than 5 seconds. Engineers capturing ANR traces found a recurring pattern: the main thread was blocked on synchronous I/O operations that had been written during early development and never migrated off the main thread as the codebase grew. The team ultimately concluded that incremental fixes were not viable at the scale of violations found and executed a full native Kotlin rewrite over 18 months.

Evidence from the Scene

  • ANR dialogs appeared multiple times per day for active users
  • ANR traces consistently showed the main thread blocked on SharedPreferences reads
  • Opening a channel for the first time triggered a synchronous SQLite query on the main thread
  • The app was written in Java with no systematic async I/O patterns across the codebase
  • Enabling StrictMode internally revealed hundreds of main-thread I/O violations across hundreds of call sites

The Suspects

3 of these are the real root causes. The others are plausible-sounding distractors.

SharedPreferences.getString() called synchronously on the main thread for workspace and notification settings

SQLite queries executed synchronously on the main thread for channel history on every navigation event

No async I/O architecture — all network and disk operations written directly in Activity and Fragment lifecycle methods

Image loading library performing full-resolution network fetches on the main thread

BroadcastReceiver performing long database operations in onReceive() without a background Worker

Deeply nested XML layouts causing slow inflation on every channel open

The Verdict

Real Root Causes

  • SharedPreferences.getString() called synchronously on the main thread for workspace and notification settings

    SharedPreferences reads block the main thread if the preference file has not been loaded into memory. In a large app with dozens of preference keys, the first read for any file triggers disk I/O. Multiplied across channel opens, message sends, and notification handling, these waits compounded into ANR-territory main-thread blocks.

  • SQLite queries executed synchronously on the main thread for channel history on every navigation event

    Fetching message history for a channel requires a non-trivial SQLite scan. Running this on the main thread blocks UI rendering for the entire query duration. On large workspaces with thousands of messages, this produced consistent ANRs on every channel switch.

  • No async I/O architecture — all network and disk operations written directly in Activity and Fragment lifecycle methods

    The original Slack Android codebase was written in Java without systematic background threading for I/O. Retrofit synchronous calls, raw SQLite access, and SharedPreferences reads had been written directly in lifecycle methods across years of feature development. When StrictMode was enabled, the violations numbered in the hundreds — too many to fix incrementally without introducing regressions.

Plausible But Wrong

  • Image loading library performing full-resolution network fetches on the main thread

    Modern image loading libraries (Glide, Picasso, Coil) are async by default and do not block the main thread. The ANR traces in the clues specifically show SharedPreferences and SQLite as the blocking operations, not image loading.

  • BroadcastReceiver performing long database operations in onReceive() without a background Worker

    Long BroadcastReceiver operations can cause ANRs — but the ANR traces in the clues specifically show SharedPreferences and SQLite blocking the main thread during user-initiated navigation, not background broadcasts.

  • Deeply nested XML layouts causing slow inflation on every channel open

    Complex layouts add tens of milliseconds to inflation time — not the 5+ second main-thread blocks needed to trigger ANR dialogs. The Perfetto evidence points to I/O, not rendering.

Summary

Slack's ANR problem was a symptom of building an app before Android's modern async ecosystem existed — and then never migrating off the original patterns as the codebase grew. SharedPreferences, SQLite queries, and network calls were called directly on the main thread across hundreds of call sites. When StrictMode was enabled internally, the violation count was too high to address incrementally without risking regressions. The team executed a full Kotlin rewrite with structured concurrency (coroutines for all I/O), Room for async database access, and DataStore as a SharedPreferences replacement. The rewrite eliminated ANRs as a category of bug.

The Real Decision That Caused This

Writing synchronous I/O operations on the main thread during early development and never systematically migrating them as the codebase scaled — until the violation count made incremental repair infeasible.

Lesson Hint

Chapter 6 (Concurrency) covers coroutines, structured concurrency, and why moving I/O off the main thread is an architectural commitment, not a tactical fix. Chapter 2 (App Architecture) covers how the ViewModel/Repository pattern enforces thread discipline that prevents this anti-pattern.

Want to test yourself before reading the verdict?

Open Interactive Case in Autopsy Lab