Random testing has proven to be an effective way to catch bugs in concurrent and distributed
systems. This is often surprising, as the space of executions is enormous and conventional formal
methods intuition would suggest that bad behaviors would only be found by extremely
unlikely coincidences.
Empirically, many bugs in distributed systems can be explained by interactions among
only a small number of features. Thus, one can attempt to explain the effectiveness of
random testing under various “small depth" hypotheses. In particular, it may be possible to
test all interactions of k features for a small constant k by executing a family of tests that is
exponentially or even doubly-exponentially smaller than the family of all tests. Moreover,
under certain conditions, a randomly chosen small set of tests is sufficient to cover all k-wise
interactions with high probability.
I will describe two concrete scenarios. First, I will describe bugs in distributed systems
caused by network partition faults. In many practical instances, these bugs occur due to two
or three key nodes, such as leaders or replicas, not being able to communicate, or because
the leading node finds itself in a block of the partition without quorum. In this case, I will
show using the probabilistic method that a small set of randomly chosen tests will cover all
“small partition” scenarios with high probability.
Second, I will consider bugs that arise due to unexpected schedules (interleavings) of
concurrent events. Again, many bugs depend only on the relative ordering of a small number
of events (the “bug depth” of the bug). In this case, I will show a testing algorithm that
prioritizes low depth interleavings and a randomized testing algorithm that bounds the
probability of sampling any behavior of bug depth k for a fixed k. The testing algorithm
is based on combinatorial insights from the theory of partial orders, such as the notion
of dimension and its generalization to d-hitting families as well as results on online chain
partitioning.