Minding your Handlers and Queues

Flaky tests are the absolute worst; getting rid of them is my favorite.

Jesse Wilson

It’s true! No one likes a flaky test. They slow down development effort and reduce confidence in your test suite and testing tools. Given that we’re the primary maintainers of Paparazzi, when one of our Paparazzi test shards became flaky on CI, it was time to dig in and figure out why.

Background

Layoutlib is a library bundled with Android Studio which powers the XML layout and compose preview panes.

Paparazzi uses Layoutlib to emulate the Android runtime environment, thus eliminating the need for an emulator or real device when generating screenshots.

Generating screenshots without an emulator or device is great! The tests are fast and you get to avoid much of the Android toolchain in the build process, but as we’ll see with our flaky test, using Layoutlib outside of the context of Android Studio can lead to some interesting problems.

The stack trace

This is the stack trace our failed CI shard would generate intermittently:

java.util.ConcurrentModificationException
  at java.base/java.util.WeakHashMap.forEach(WeakHashMap.java:1031)
  at com.android.layoutlib.bridge.util.HandlerMessageQueue.extractFirst(HandlerMessageQueue.java:73)
  at android.os.Handler_Delegate.executeCallbacks(Handler_Delegate.java:95)
  at app.cash.paparazzi.Paparazzi.withTime(Paparazzi.kt:337)
  at app.cash.paparazzi.Paparazzi.takeSnapshots(Paparazzi.kt:298)
  at app.cash.paparazzi.Paparazzi.snapshot(Paparazzi.kt:214)
  at app.cash.paparazzi.Paparazzi.snapshot$default(Paparazzi.kt:213)

From the ConcurrentModificationException JavaDoc:

This exception may be thrown by methods that have detected concurrent modification of an object when such modification is not permissible.

For example, it is not generally permissible for one thread to modify a Collection while another thread is iterating over it.

Looking at the stack trace, it’s easy enough to see where the first condition for this exception occurs, but where is the second? Let’s take a closer look at HandlerMessageQueue:

public class HandlerMessageQueue {
  private final WeakHashMap<...> runnablesMap = new WeakHashMap<>();

  public void add(Handler h, long uptimeMillis, Runnable r) {
    LinkedList<...> runnables = runnablesMap.computeIfAbsent(h, k -> new LinkedList<>());
    ...
  }

  public Runnable extractFirst(long uptimeMillis) {
    ...
    runnablesMap.forEach((h, l) -> {
      ... 
    });
    ... 
  }

  public void clear() {
    runnablesMap.clear();
  }

We can see that runnablesMap is responsible for the exception and can now point to two places where our map might be getting altered: HandlerMessageQueue.clear and HandlerMessageQueue.add.

Pursuing our leads

HandlerMessageQueue.clear()

If we follow the call hierarchy up from this method we see that, eventually, Paparazzi does make a call that will result in this method being executed, but it’s unlikely that this is the culprit; JUnit tests are single-threaded by default and this method is only called at the end of a given test run.

HandlerMessageQueue.add(Handler, long, Runnable)

This method is a little more complicated but right off the bat we have an interesting line to look at:

public void add(Handler h, long uptimeMillis, Runnable r) {
  LinkedList<...> runnables = runnablesMap.computeIfAbsent(h, k -> new LinkedList<>());
  ...
}

runnablesMap.computeIfAbsent will search for the Handler and if it’s not present in the map, it will be associated with a new, empty LinkedList.

At this point, I should mention that Layoutlib’s APIs are generally called from a single thread since it’s dealing with rendering views which typically occur on Android’s main thread. However, apps are rarely single-threaded.

Handlers can be created from any thread and posted to any thread. Layoutlib will intercept calls to Handler.post and call HandlerMessageQueue.add instead, meaning it can run from any thread, too. If that occurs while HandlerMessageQueue.extractFirst is running, it could result in a ConcurrentModificationException.

To test this out we can use the following test case:

@Test 
fun test() {
  // 1
  thread {
    while (true) {
      Handler(Looper.getMainLooper()).post{}
    }
  }
  // 2
  for (i in 0..10) {
    Handler(Looper.getMainLooper()).post {}
  }
  // 3
  paparazzi.snapshot {}
}
  1. Start a background thread which creates new Handlers. This will result in calls to HandlerMessageQueue.add.
  2. Give the main thread some scheduled work in order to increase the likelihood of HandlerMessageQueue.add and HandlerMessageQueue.extractFirst running at the same time.
  3. Call paparazzi.snapshot in order to cause HandlerMessageQueue.executeFirst to run and clear the main thread work.

After running this test we will see the exact same ConcurrentModificationException get thrown. This means somewhere in our code we are doing something that has a similar effect.

Zeroing in on the cause

It turns out that one of our views transitively uses LottieAnimationView. A quick search of the Lottie GitHub repository for Handler yields a hit on a class called LottieTask. Could we be doing something that leads to the creation of this LottieTask? Let’s see!

One of the first methods we call on LottieAnimationView is setAnimation which calls fromRawRes in turn:

// LottieAnimationView
public void setAnimation(final int rawRes) {
  ...
  setCompositionTask(fromRawRes(rawRes));
}

private LottieTask<...> fromRawRes(final int rawRes) {
  ...
  return cacheComposition ? LottieCompositionFactory.fromRawRes(getContext(), rawRes) : LottieCompositionFactory.fromRawRes(getContext(), rawRes, null);
  ...
}

LottieCompositionFactory.fromRawRes() is the method that actually creates our LottieTask and here’s where things start to get interesting.

public class LottieTask<T> {
  public static Executor EXECUTOR = Executors.newCachedThreadPool();

  private final Handler handler = new Handler(Looper.getMainLooper());

  LottieTask(Callable<LottieResult<T>> runnable, boolean runNow) {
    if (runNow) {
      try {
        setResult(runnable.call());
      } catch (Throwable e) {
        setResult(new LottieResult<>(e));
      }
    } else {
      EXECUTOR.execute(new LottieFutureTask(runnable));
    }
  }

  private void notifyListeners() {
    // Listeners should be called on the main thread.
    handler.post(() -> {
      ... 
    });
  }

LottieCompositionFactory.fromRawRes() will eventually call the constructor above with runNow set to false and the Runnable will be executed on a background thread via the EXECUTOR. When the task is done, it will notifyListeners() which will cause the new Handler to be added to the HandlerMessageQueue.

With a probable culprit identified, it’s time to apply a fix.

Fixing the flake with a workaround

Lottie made a conscious decision to make LottieTask.EXECUTOR public and static for the benefit of its consumers and their tests, so the fastest way to fix our flaky behavior is to overwrite the value of LottieTask.EXECUTOR in your Paparazzi test, like so:

@Before
fun setup() {
  LottieTask.EXECUTOR = Executor(Runnable::run)
}

Fixing the underlying problem

Layoutlib needs to account for views which create Handlers from non-main threads. This map and the lists it contains need to be safe for concurrent access to correctly emulate what the real Android framework supports. We’ve filed an issue with the Android Studio team to address this problem. One solution would be to have each method synchronize on HandlerMessageQueue.runnablesMap. Doing so will protect both the map and its lists from concurrent access.

Here’s an example:

  public void add(Handler h, long uptimeMillis, Runnable r) {
    synchronized (runnablesMap) {
      LinkedList<Pair<Long, Runnable>> runnables = runnablesMap.computeIfAbsent(h, k -> new LinkedList<>());

      int idx = 0;
      while (idx < runnables.size()) {
        if (runnables.get(idx).first <= uptimeMillis) {
          idx++;
        } else {
          break;
        }
      }
      runnables.add(idx, Pair.create(uptimeMillis, r));
    }
  }

No more flakes

In future versions of Paparazzi, when Layoutlib is updated to a version with the proper fix, workarounds like the one above won’t be necessary. Until then, give this one a shot!

If you’d like to try out Paparazzi, get it on GitHub.