Instrumenting Quicksort

I’d like to begin the new year here in The Journeyman’s Shop with an examination of something fairly old, by computing standards, that you’re probably familiar with: the quicksort algorithm. It was first published by C.A.R. Hoare in 1962, and since then it has become one of the most widely used, most widely analyzed, and perhaps most widely misunderstood techniques for sorting data held in memory. As usual, you can find more information than you’ll ever need in Donald Knuth’s "The Art of Computer Programming," vol. 3, "Sorting and Searching."

The quicksort algorithm has undergone some significant improvements since Hoare’s initial version. These improvements have been driven by the need to avoid the quadratic time complexity and linear space requirement that quicksort can fall into. As the conditions that lead to these problems have been recognized, the details of the algorithm have been changed to avoid them. Thus, time and space requirements have been improved in the worst cases, usually at the cost of somewhat slower execution in the ordinary cases. In this column we’ll look at these improvements to quicksort, with the goal of better understanding how the time and space problems arise in quicksort, how these problems are solved, and what effect solving these problems has on performance. Although you probably won’t ever have to write an efficient version of quicksort, the techniques that we use in analyzing quicksort ought to be useful to you in tuning your own code.

The Problem

Given two iterators first and last that define a half-open range [first,last) and an operator < that can be used to compare two objects of the type that first and last point to, our task is to write an algorithm that arranges the elements in that range in increasing order. More formally, after we have applied the algorithm, it must be true that for every value of j in the half-open range [0,last - first)¹, !(*(first + j + 1) < *(first + j)².

The Preliminaries

In order to make sure that we don’t accidentally use an operation that isn’t necessarily available for the items that we’ll be sorting, we need to write a simple class that provides only the basic operations that are essential to our algorithm. We need to be able to compare two items with the operator <, and we need to be able to copy items. We also need to be able to create items that have distinct values, and it will be useful to be able to display the values of these items. That leads directly to the following class definition:


class item
{
public:
    // create:
    explicit item(int v) : val(v) {}
    // copy:
    item(const item& other) : val(other.val) {}
    item& operator = (const iterm& other)
        { val = other.val; return *this; }
    // compare:
    bool operator < (const item& other) const
        { return val < other.val; }
    // display:
    friend std::ostream& operator <<
        (std::ostream& os, const item& it)
        { return os << it.val; }
private:
    int val;
};

The Basic Quicksort Algorithm

The quicksort algorithm uses the divide and conquer approach: the range to be sorted is divided into two parts, such that each of the elements in the first part is less than each of the elements in the second part. The algorithm is then re-applied to each of the two parts. As long as we make sure that neither part is ever empty, each time that we partition a range we’ll end up with two parts that each contain fewer elements than the range that we just partitioned. Sooner or later we’ll get down to parts that contain only one element, and those parts don’t need any further work: they’re already in the correct order. So we have the two essential parts of a recursive algorithm: a recursion process that breaks large pieces into smaller ones, and a terminating condition so that we know when to stop the recursion.

The implementation of any algorithm begins by defining how it is invoked. In the case of quicksort, since we’re sorting the elements in a range, we need to specify the beginning and the end of the range to be sorted. And since we’re writing an algorithm that can handle almost arbitrary types we need to write a template:


template<class RanIt>
void sort(RanIt first, RanIt last);

I’ve chosen the name RanIt for the type of the iterator to suggest that the iterator should be a random access iterator. Some of the operations that we perform on these iterators are supported only by random access iterators.

As with most recursive algorithms, the first thing that the code should do is check whether the terminating condition is true. The terminating condition is that the range contains only one element. However, it costs nothing to also test for zero elements, which makes the algorithm safe when called with an empty range:


if (last - first <= 1)
    return;

Having survived that test, our code next needs to begin the partitioning process. In its simplest form, the quicksort algorithm uses the first element in the range as a pivot value, and separates the rest of the range into one piece that contains elements that are less than the pivot value and one piece that contains elements that are greater than or equal to the pivot value. The usual technique is to build up one piece starting from the left end of the range and to build up the other piece starting from the right end of the range. As we shuffle elements around the left-hand piece grows to the right, and the right-hand piece grows to the left. In order to talk about these two new ranges more easily we’ll introduce four new iterator variables, one pair for each of the two pieces that we’re creating, plus an iterator that points to the pivot element:


RanIt pivot = first;    // pivot element
RanIt first1, last1;    // left-hand piece
RanIt first2, last2;    // right-hand piece

Since we haven’t yet made any comparisons of elements, each of the two subranges must be empty. The left-hand subrange begins with the element immediately after the pivot element, so its iterators both get initialized to point to that element. The right-hand subrange ends at the end of the input range, so its iterators both get initialized with the value of the iterator for the end of the range:


first1 = last1 = first + 1;
first2 = last2 = last;

Now we’re ready to expand both pieces, maintaining the two invariants that all of the elements in the half-open range [first1,last1) are less than the pivot element and that all of the elements in the half-open range [first2,last2) are greater than or equal to the pivot element. Since both ranges are initially empty, these two invariants are both true when the algorithm begins. We start by examining the element just past the end of the first range. If its value is less than the value of the pivot element then it belongs to that range, and we can move the top end of that range up to include it, then go back and try the next element:


while (last1 < first2 && *last1 < *pivot)
    ++last1;

When this loop exits, last1 points to the beginning of the upper range, in which case we won’t find any more elements that belong in the lower range, or to an element whose value is greater than or equal to the value of the pivot element.

Similarly, we expand the upper range downward by looking at the element immediately below the lower end of the range, and adding it to the range if it is greater than or equal to the value of the pivot element:


while (last1 <= first2 - 1 && *pivot < *(first2 - 1))
    --first2;

When this loop exits, first2 points to the upper end of the lower range or to an element whose value is less than the value of the pivot element.

After executing both of these loops, if the two iterators haven’t met in the middle they point to elements that don’t belong in the subrange where they were found. We exchange these two elements, and resume the comparisons. If the iterators met then we’ve partitioned the input range into the two desired subranges, and its time to use quicksort on the two subranges. The initial version of quicksort, along with a driver program, is in Listing 1.

Performance

I mentioned earlier that the quicksort algorithm can fall into quadratic time complexity and linear space usage. In order to be able to investigate these performance problems I’ve added two variables to the implementation of the algorithm. The variable maxdepth keeps track of the deepest nesting of the recursive calls to sort. The variable ops keeps rough track of the number of operations that have been performed. With these two variables we can perform simple experiments to investigate the performance of the algorithm. We can change the value of the constant ELTS to change the number of elements that we’re sorting, and we can vary the patterns in the data that we sort. As we’ll see, there are some data patterns that lead to very bad performance with this version of the algorithm.

You probably know that the time to sort a range of values using quicksort should be proportional to NlogN, where N is the number of elements being sorted. To see this happen, begin by running the program in Listing 1 as is. It sorts 100,000 values, initially in random order. Then change the value of ELTS to 1000000 and run it again, and finally, change the value of ELTS to 10000000 and run it again. For one hundred thousand elements, NlogN is 500,000. For a million elements it’s 6,000,000, and for ten million it’s 70,000,000. So running this sort implementation with one million elements ought to require about twelve times as many operations as running it with one hundred thousand elements, and running it with ten million elements ought to require about 140 times as many. With the compiler I used I got the following values³:

Elements	Operations	Predicted Operations
100,000	2,139,243	---
1,000,000	25,146,540	25,670,916
10,000,000	304,778,130	299,494,020

As you can see, the measured values are within about two percent of the predicted values.

This version of quicksort falls apart when the data that it is given is already in order. To see this happen, comment out the call to std::random_shuffle in main and run the program again. You might want to reduce the value of ELTS before you do this: with ELTS set to 1,000 I got 500,499 operations. With ELTS set to 10,000, I got 50,004,999 operations. A little math shows that increasing the number of elements by a factor of ten increased the number of operations by a factor of one hundred. For data that is already in order, the time complexity of this version of quicksort is O(n^2).

There’s another problem with this version of quicksort: look at the call depth when it processes a range of sorted data. With 1,000 elements it reaches a depth of 998 nested calls; with 10,000 elements it reaches a depth of 9,998. Since each nested call requires a fixed amount of space for its local variables, the space usage of this version of quicksort is linear in the number of elements being sorted when it is applied to data that is already in order. That’s far too much space. In fact, when I tried to sort 100,000 elements that were already in order I got a stack overflow. A sort implementation that crashes when you give it 100,000 elements isn’t very useful.

Fixing the Space Problem

We can fix the space problem by eliminating some of the recursive calls. Notice that the last statement in the sort function is a recursive call to sort itself. This is known as "tail recursion." Tail recursion can usually be eliminated by writing a loop instead of the final call. Listing 2 has a new version of the sort template with the tail recursion removed. Replace the sort implementation in Listing 1 with this version, then repeat the tests that we just did to see what effect this has on memory usage. It may seem a bit strange to see a maximum depth of 0, but that’s just a side effect of the already-sorted input. Put the std::random_shuffle back in and you’ll see that we still get about the same stack usage as before for random data.

We’re not finished yet, though. Replace the call to std::random_shuffle with a call to std::reverse, so that the incoming data is sorted in reverse order. For this data the stack usage goes way up, although not as far as it did for already sorted data before we eliminated the tail recursion.

The reason that the stack usage was so high in the first place is that when we split the input range into two pieces, the algorithm chose the first value as the pivot value. When the data is already in order, this means that the pivot value is the lowest value in the range, so the first piece consists of just the pivot value, and the second piece consists of all the values except for the pivot. That is, each time we split the input range we create a subrange with one element and a subrange with N - 1 elements. Although we do make progress with each of these partitions, we don’t make much progress. We ended up calling sort recursively once for each element in the input sequence. Eliminating the tail recursion fixes this space problem when the input range is already sorted, but doesn’t help much when the input range is sorted in reverse order. In order to get rid of the problem here, too, we need to make the code that eliminates the tail recursion just a little smarter. If we call sort for the smaller of the two pieces, and loop on the larger, we’ll reduce stack usage significantly. This version of the algorithm is shown in Listing 3.

Fixing the Time Problem

None of these changes affected the number of operations needed to sort any particular input sequence. We’ve avoided stack overflows, but we haven’t increased the speed of the sort. For a sorted input sequence and for a reverse sorted input sequence the time complexity is still O(n^2). That’s because in each of these cases we’re choosing the worst possible pivot. Ideally we’d like to have a pivot value that will end up near the middle of the sorted sequence, so that the two subranges are about the same size. The usual solution is to look at several values and to pick the middle value of those examined. Most often this is done with three values, and this approach is known as "median of three."

The code in Listing 4 implements this approach. It uses a bubble sort to sort the elements pointed to by mid, first, and last - 1. The result is that the median of the three values pointed to by first, mid, and last - 1 ends up at the beginning of the range being sorted, so that’s the value that will be used as the pivot. If you rerun the previous tests with this version of the algorithm you’ll see that the number of operations needed to sort a range of already sorted values drops significantly⁴. For sorted and reverse-sorted data I get 119,535 operations to sort a range with 10,000 elements, and 1,534,481 operations to sort a range with 100,000 elements, giving a ratio of about 12.8. That’s what we’d expect if the time complexity is O(NlogN). So we’ve eliminated the quadratic behavior for sorted data, without affecting the time complexity for random data.

Next Time

More recent investigations of the quicksort algorithm have focused on another issue: how to make the algorithm run quickly for equal data. If I generate 10,000 equal elements and sort them the latest version of the algorithm performs 125,264 operations. It ought to be possible to recognize that all the elements are equal with only 9,999 operations. If we can modify the quicksort algorithm to deal quickly with equal data elements, without slowing it down too much for random data, we’ll have yet another significant improvement in performance. Next month we’ll look at Sedgwick’s version of the quicksort algorithm, which is probably the most commonly used technique today, and we’ll look at the approach that Dinkumware now uses in our libraries, which I have named the Inside Out sort.