Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)

Tim Peters developed the Timsort hybrid sorting algorithm in 2002. It is a clever combination of ideas from merge sort and insertion sort, and designed to perform well on real world data. TimSort was first developed for Python, but later ported to Java (where it appears as java.util.Collections.sort and java.util.Arrays.sort) by Joshua Bloch (the designer of Java Collections who also pointed out that most binary search algorithms were broken). TimSort  is today used as the default sorting algorithm for Android SDK, Sun’s JDK and OpenJDK. Given the popularity of these platforms this means that the number of computers, cloud services and mobile phones that use TimSort for sorting is well into the billions.

Fast forward to 2015. After we had successfully verified Counting and Radix sort implementations in Java (J. Autom. Reasoning 53(2), 129-139) with a formal verification tool called KeY, we were looking for a new challenge.  TimSort seemed to fit the bill, as it is rather complex and widely used. Unfortunately, we weren’t able to prove its correctness. A closer analysis showed that this was, quite simply, because TimSort was broken and our theoretical considerations finally led us to a path towards finding the bug (interestingly, that bug appears already in the Python implementation). This blog post shows how we did it.

The paper with the complete analysis, and several test programs are available on our website.

Update (August 2017): The KIT group used KeY to verify the JDK’s dual pivot quicksort implementation used as default sorting algorithm for integer or long arrays. TimSort is the default for arrays of reference type.

Structure of this blog post

  1. The TimSort bug in Android, Java and Python
    1.1 Reproduce TimSort bug in Java
    1.2 How does TimSort work (in principle)?
    1.3 Walkthrough of TimSort bug
  2. Proving the (in)correctness of TimSort
    2.1 The verification system KeY
    2.2 The fix and its formal specification
    2.3 Analysing the output of KeY
  3. Suggested fixes to the Python and Android/Java Timsort bugs
    3.1 Incorrect Python merge_collapse function
    3.2 Corrected Python merge_collapse function
    3.3 Incorrect Java/Android merge_collapse function
    3.4 Corrected Java/Android merge_collapse function
  4. Conclusion – What can we learn?

1. The TimSort bug in Android, Java and Python

So what’s the bug? Why don’t you try to reproduce it yourself first?

1.1 Reproduce TimSort bug in Java

git clone https://github.com/abstools/java-timsort-bug.git
cd java-timsort-bug
javac *.java
java TestTimSort 67108864

Expected output

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 40
at java.util.TimSort.pushRun(TimSort.java:413)
at java.util.TimSort.sort(TimSort.java:240)
at java.util.Arrays.sort(Arrays.java:1438)
at TestTimSort.main(TestTimSort.java:18)

Video of walkthrough

1.2 How does TimSort work (in principle)?

TimSort is a hybrid sorting algorithm that uses insertion sort and merge sort.

The algorithm reorders the input array from left to right by finding consecutive (disjoint) sorted segments (called “runs” from hereon). If the run is too short, it is extended using insertion sort.  The lengths of the generated runs are added to an array named runLen. Whenever a new run is added to runLen, a method named mergeCollapse merges runs until the last 3 elements in runLen satisfy the following two conditions (the “invariant”):

  1. runLen [n-2] > runLen [n-1] + runLen [n]
  2. runLen [n-1] > runLen [n] 

Here n is the index of the last run in runLen.  The intention is that checking this invariant on the top 3 runs in runLen in fact guarantees that all runs satisfy it. At the very end, all runs are merged, yielding a sorted version of the input array.

For performance reasons, it is crucial to allocate as little memory as possible for runLen, but still enough to store all the runs.  If the invariant is satisfied by all runs, the length of each run grows exponentially (even faster than fibonacci: the length of the current run must be strictly bigger than the sum of the next two runs lengths).  Since runs do not overlap, only a small number of runs would then be needed to cover even very big input arrays completely.

1.3 Walkthrough of TimSort bug

The code snippet below shows that the implementation of mergeCollapse checks the invariant for the last 3 runs in runLen.

private void mergeCollapse() {
  while (stackSize > 1) {
    int n = stackSize - 2;
    if (n > 0 && runLen[n-1] <= runLen[n] + runLen[n+1]) {
      if (runLen[n - 1] < runLen[n + 1])
         n--;
      mergeAt(n);
    } else if (runLen[n] <= runLen[n + 1]) {
      mergeAt(n);
    } else {
      break; // Invariant is established
    }
  }
}

Unfortunately, this is not sufficient to ensure that all runs satisfy the invariant. Suppose that runLen has the following content on entry to mergeCollapse:

120, 80, 25, 20, 30

In the first loop iteration, 25 and 20 are merged (since 25 < 20 + 30 and 25 < 30):

120, 80, 45, 30

In the second iteration (now n=3), it is determined that the invariant is satisfied by the last 3 runs, since 80 > 45 + 30 and 45 > 30, thus mergeCollapse terminates. But mergeCollapse has not fully restored the invariant: it is broken by 120, since 120 < 80 + 45.

The testgenerator on our website exploits this problem. It generates an input array with many short runs – too short, in the sense that they do not satisfy the invariant – which eventually causes TimSort to crash. In particular, since by breaking the invariant, the length of the runs can grow slower than expected, more than runLen.length runs are needed to cover the entire input array, resulting in an ArrayOutOfBoundsException on runLen.

2. Proving the (in)correctness of TimSort

We found out that the supposed invariant of mergeCollapse is broken during an attempt to formally verify TimSort. Luckily, we did not only find out that it is broken, but also how it can be fixed. In the end we even succeeded to verify a corrected invariant that actually holds. But let’s take it step by step. First of all: what do we mean with formal verification and how is it done?

2.1 The verification system KeY

Due to many requests we have published a brief blog post about KeY.

KeY is a deductive verification platform for sequential Java and JavaCard applications. It allows to prove the correctness of programs with respect to a given specification. Roughly speaking, a specification consists of a precondition, also called requires clause and a postcondition, also called ensures clause. Specifications are attached to method implementations, such as mergeCollapse() above. The specification of a method is also called its contract.

In the case of a sorting program, the precondition might simply state that the input is a non-empty array and the postcondition that the returned array is a sorted permutation of the input. What KeY then typically proves is this: whenever the method under verification is called with an input that satisfies the precondition, then the method terminates normally and in the final state the postcondition is true. This is also known as total correctness, because termination is ensured. Obviously, OpenJDK’s java.utils.Array.sort() does not adhere to this contract, because it terminates exceptionally for certain inputs.

In addition, instance and class invariants are used to specify general constraints on the values of fields. Typical properties are concerned with data consistency or with boundary values:

/*@ private invariant
  @    runBase.length == runLen.length && runBase != runLen;
  @*/

This invariant says that the length of the arrays runBase and runLen must be equal and that both arrays must not point to the same array instance, i.e., they are not aliased. The semantics of invariants implies that each method must establish upon completion not only the postcondition of its contract, but also the invariant of its (“this”) object.

As specification language in KeY the Java Modeling Language (JML) is used. It contains pure Java expressions as a sublanguage and is therefore easy to learn for Java programmers. Its main extension beyond Java are quantified expressions (\forall T x, \exists T x) and, of course, suitable keywords for contracts. JML specifications are attached to the Java declarations they belong to in .java files. Here is a simple example of a Java method, specified with JML:

/*@ private normal_behavior
  @ requires
  @   n >= MIN_MERGE;
  @ ensures
  @   \result >= MIN_MERGE/2;
  @*/

private static int /*@ pure @*/ minRunLength(int n) {
  assert n >= 0;
  int r = 0;      // Becomes 1 if any 1 bits are shifted off
  /*@ loop_invariant n >= MIN_MERGE/2 && r >=0 && r<=1;
    @ decreases n;
    @ assignable \nothing;
    @*/
  while (n >= MIN_MERGE) {
    r |= (n & 1);
    n >>= 1;
  }
  return n + r;
}

The contract of minRunLength() requires the caller to ensure that the method is only invoked with a value greater-or-equal to MIN_MERGE. In this case (and only in this) the method ensures that it will terminate normally (i.e., neither diverge nor throw an exception) and that the returned value is at least as big as MIN_MERGE/2. In addition, the method is marked as pure which implies that the method does not modify the heap.

The crucial point is that KeY can statically prove such method contracts for any given input. How is this possible? KeY performs symbolic execution of the method under verification, that is, it executes it with symbolic values so that all possible execution paths are taken into account. But this is not enough, because symbolic execution of loops without a fixed bound (such as the one in mergeCollapse() where we don’t know the value of stackSize) will not terminate. To render symbolic execution of loops finite, invariant reasoning is used. For instance, method minRunLength() from above contains a loop which is specified using a loop invariant. The invariant ensures that after each loop iteration the condition n >= MIN_MERGE/2 && r >=0 && r<=1 holds and hence the method’s postcondition can be proven. The decreases annotation is used to prove termination of the loop by providing an expression whose value is non-negative and strictly decreasing. The assignable clause lists the heap locations that might possibly be modified by the loop. The keyword \nothing means that no heap locations are modified. Indeed: only the local variable r and the value argument n are changed.

In summary, for the purpose of formal verification, method contracts are not enough. It is necessary to supply a suitable loop invariant. It can be a tricky business to come up with an invariant that is strong enough to ensure the desired postcondition and that still holds. Without tool support and automated theorem proving technology it is hardly possible to come up with correct loop invariants for non-trivial programs. And in fact, it is exactly here that the designers of TimSort went wrong. The loop of mergeCollapse causes under certain circumstances the following part of TimSort’s class invariant to be violated (see section 1.3 Walkthrough of TimSort bug):

/*@ private invariant 
  @   (\forall int i; 0<=i && i<stackSize-4; 
  @                      runLen[i] > runLen[i+1] + runLen[i+2]))
  @*/

which states that runLen[i] must be greater than the size of the two successor entries (for index i within 0 (including) and stackSize-4 (excluding).  As the invariant is also not restored later on method mergeCollapse does not preserve the class invariant either. Hence, the loop invariant was not as strong as assumed by the developers. We found this out during our formal verification attempt, with the help of KeY. It is nearly impossible to do so without tool support.

Despite being very close to Java, JML is a full-fledged design-by-contract language, suitable for full functional verification of Java programs.

2.2  The fix and its formal specification

A simplified version of the contract that mergeCollapse is expected to satisfy is shown below.

/*@ requires
  @   stackSize > 0 &&
  @   runLen[stackSize-4] > runLen[stackSize-3]+runLen[stackSize-2]
  @   && runLen[stackSize-3] > runLen[stackSize-2];
  @ ensures
  @   (\forall int i; 0<=i && i<stackSize-2; 
  @                     runLen[i] > runLen[i+1] + runLen[i+2])
  @   && runLen[stackSize-2] > runLen[stackSize-1]
  @*/
private void mergeCollapse()

The two formulas in ensures imply that when mergeCollapse completes, then all runs satisfy the invariant given in section 1.2. We saw already that the above contract is not satisfied by the current implementation of mergeCollapse (in section 1.3), so we provide the following fixed version that respects the contract:

private void newMergeCollapse() {
  while (stackSize > 1) {
    int n = stackSize - 2;
    if (n > 0   && runLen[n-1] <= runLen[n] + runLen[n+1] || 
        n-1 > 0 && runLen[n-2] <= runLen[n] + runLen[n-1]) {
      if (runLen[n - 1] < runLen[n + 1])
        n--;
    } else if (n<0 || runLen[n] > runLen[n + 1]) {
      break; // Invariant is established
    }
    mergeAt(n);
  }
}

The main idea of this new version is to check that the invariant holds for the last 4 runs in runLen, instead of only the last 3.  We will see that this suffices to ensure that all runs satisfy the invariant upon completion of mergeCollapse.

The first step in proving the contract for our fixed version of mergeCollapse is to find a suitable loop invariant. The code snippet below shows a simplified version of the loop invariant.

/*@ loop_invariant
  @  (\forall int i; 0<=i && i<stackSize-4; 
  @             runLen[i] > runLen[i+1] + runLen[i+2])
  @  && runLen[stackSize-4] > runLen[stackSize-3])
  @*/

Intuitively this loop invariant states that all runs, except possibly the last 4, satisfy the invariant.  Combining this with the observation that the new loop of mergeCollapse terminates (with the break statement) only if the last 4 runs also satisfy it, this guarantees that all runs satisfy the invariant.

2.3 Analysing the output of KeY

When the fixed version of mergeCollapse, its contract and the loop invariant are given as input to KeY, the system symbolically executes the loop and generates verification conditions: formulas whose truth imply that the mergeCollapse contract is satisfied. The following formula (simplified) shows the main proof obligation generated by KeY:

This verification condition is generated to make sure that the postcondition of mergeCollapse  is satisfied when the loop terminates.  This explains the three formulas between the brackets: the break statement that terminates the loop is only executed if these are true. We proved this formula (and all other verification conditions) formally with KeY in a semi-automated manner. Here we sketch this proof:

Proof. The formula runLen[stackSize-2] > runLen[stackSize-1]  from the mergeCollapse postcondition follows directly from  n >= 0 ==> runLen[n] > runLen[n+1].

We prove the other formula,

\forall int i; 0<=i && i<stackSize-2; runLen[i] > runLen[i+1] + runLen[i+2],

by case distinction on the value of i:

  • i < stackSize-4: follows from the loop invariant
  • i = stackSize-4: follows from n>1 ==> runLen[n-2] > runLen[n-1] + runLen[n]
  • i = stackSize-3: from n>0 ==> runLen[n-1] > runLen[n] + runLen[n+1]
  • i = stackSize-2: from n>=0 ==> runLen[n] > runLen[n+1]

The above proof shows that the new version of mergeCollapse terminates only when all runs satisfy the invariant.

3. Suggested fixes to the Python and Android/Java Timsort bugs

Our analysis of the bug (which included the fix for mergeCollapse) was submitted, reviewed and accepted in the Java bug tracker  https://bugs.openjdk.java.net/browse/JDK-8072909.

The bug is present in at least the Android version of Java, OpenJDK and OracleJDK: all of these share the same source code for TimSort.  Furthermore it is also present in Python. The next two sections show (side-by-side) the original and fixed versions.

As explained in the previous section, the idea behind the fix is very simple: check that the invariant holds for the last 4 runs in runLen, instead of only the last 3.

3.1 Incorrect Python merge_collapse function

Timsort for Python (written in C with the Python API) is available in the subversion repository – The algorithm is also described in http://svn.python.org/projects/python/trunk/Objects/listsort.txt

The Java version of TimSort was ported from the original CPython version. That version also contains the bug and was intended to work for arrays with up to 2^64 elements. However, on current machines it is not possible to trigger an out-of-bounds error in the Python version: it allocates 85 elements for runLen, which suffices (following our analysis in the full paper) for arrays with less than 2^49 elements.  For comparison, the current most powerful supercomputer http://en.wikipedia.org/wiki/Tianhe-2 has about 2^50 bytes of memory in total.

/* The maximum number of entries in a MergeState's 
 * pending-runs stack.
 * This is enough to sort arrays of size up to about
 *     32 * phi ** MAX_MERGE_PENDING
 * where phi ~= 1.618.  85 is ridiculously large enough, 
 * good for an array with 2**64 elements.
 */
#define MAX_MERGE_PENDING 85

merge_collapse(MergeState *ms)
{
    struct s_slice *p = ms->pending;

    assert(ms);
    while (ms->n > 1) {
        Py_ssize_t n = ms->n - 2;
        if (n > 0 && p[n-1].len <= p[n].len + p[n+1].len) {
            if (p[n-1].len < p[n+1].len)
                --n;
            if (merge_at(ms, n) < 0)
                return -1;
        }
        else if (p[n].len <= p[n+1].len) {
                 if (merge_at(ms, n) < 0)
                        return -1;
        }
        else
            break;
    }
    return 0;
}

3.2 Corrected Python merge_collapse function

merge_collapse(MergeState *ms)
{
    struct s_slice *p = ms->pending;

    assert(ms);
    while (ms->n > 1) {
        Py_ssize_t n = ms->n - 2;
        if (     n > 0   && p[n-1].len <= p[n].len + p[n+1].len
            || (n-1 > 0 &&  p[n-2].len <= p[n].len + p[n-1].len)) {
            if (p[n-1].len < p[n+1].len)
                --n;
            if (merge_at(ms, n) < 0)
                return -1;
        }
        else if (p[n].len <= p[n+1].len) {
                 if (merge_at(ms, n) < 0)
                        return -1;
        }
        else
            break;
    }
    return 0;
}

3.3 Incorrect Java/Android merge_collapse function

Same bug as for Python in section 3.1

   private void mergeCollapse() {
        while (stackSize > 1) {
            int n = stackSize - 2;
            if (n > 0 && runLen[n-1] <= runLen[n] + runLen[n+1]) {
                if (runLen[n - 1] < runLen[n + 1])
                    n--;
                mergeAt(n);
            } else if (runLen[n] <= runLen[n + 1]) {
                mergeAt(n);
            } else {
                break; // Invariant is established
            }
        }
    }

3.4 Corrected Java/Android merge_collapse function

Equivalent fix as for Python in section 3.2
[UPDATE 26/2: we updated the code below as it was from an earlier version of the paper. The old code was equivalent but contained a redundant test and different coding style. Several people noticed – thanks for the feedback!]

   private void newMergeCollapse() {
     while (stackSize > 1) {
       int n = stackSize - 2;
       if (   (n >= 1 && runLen[n-1] <= runLen[n] + runLen[n+1])
           || (n >= 2 && runLen[n-2] <= runLen[n] + runLen[n-1])) {
                if (runLen[n - 1] < runLen[n + 1])
                    n--;
            } else if (runLen[n] > runLen[n + 1]) {
                break; // Invariant is established
            }
            mergeAt(n);
        }
    }

4. Conclusion – What we can learn

While attempting to verify TimSort, we failed to establish its instance invariant. Analysing the reason, we discovered a bug in TimSort’s implementation leading to an ArrayOutOfBoundsException for certain inputs. We suggested a proper fix for the culprit method (without losing measurable performance) and we have formally proven that the fix actually is correct and that this bug no longer persists.

There are a few observations that can be drawn from this exercise beyond the immediate issue of the bug.

  1. Formal methods are often classified as irrelevant and/or impracticable by practitioners. This is not true: we found and fixed a bug in a piece of software that is used by billions of users every single day. Finding and fixing this bug without a formal analysis and the help of a verification tool is next to impossible, as our analysis showed. It has been around for years in a core library routine of Java and Python. Earlier occurrences of the underlying bug were supposedly fixed, but actually only made its occurrence less likely.
  2. Even though the bug itself is unlikely to occur, it is easy to see how it could be used in an attack. It is likely that more undetected bugs are in other parts of core libraries of mainstream programming languages. Shouldn’t we try to find them before they can do harm or they can be exploited?
  3. The reaction of the Java developer community to our report is somewhat disappointing: instead of using our fixed (and verified!) version of mergeCollapse(), they opted to increase the allocated runLen “sufficiently”. As we showed, this is not necessary. In consequence, whoever uses java.utils.Collection.sort() is forced to over allocate space. Given the astronomical number of program runs that such a central routine is used in, this leads to a considerable waste of energy. As to the reasons, why our solution has not been adopted, we can only speculate: perhaps the JDK maintainers did not bother to read our report in detail, and therefore don’t trust and understand our fix. After all, Open Java is a community effort, largely driven by volunteers with limited time.

What can we learn from this? We would be happy if our work could be the starting point of a closer collaboration between the formal methods and the developers of open language frameworks. Formal methods have already been adopted successfully by Amazon [link] and Facebook [link]. Modern formal specification languages and formal verification tools are not cryptic and super-hard to learn. Usability and automation are improving constantly. But we need more people to try, test and use our formal tools. Yes, it costs a little effort to start formally specifying and verifying stuff, but not more than, say, learning how to use a compiler framework or a build tool. We are talking days/weeks, not months/years. Will you take up the challenge?


Best regards,
Stijn de Gouw, Jurriaan Rot, Frank S. de Boer, Richard Bubel and Reiner Hähnle

Acknowledgements:

Partly funded by the EU project FP7-610582 ENVISAGE: Engineering Virtualized Services (http://www.envisage-project.eu).

This blog would never have gotten written without the enthusiastic support and gentle pushing of Amund Tveit! We would further like to thank Behrooz Nobakht for providing the video showcasing the bug.

Envisage logo

24 thoughts on “Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it)

  1. Pingback: Quora

Leave a Reply