Radix Sort Revisited

Pierre Terdiman

Last revision: 04.01.2000

In every decent programmer’s toolbox lies a strange weapon called a Radix Sort. Where does it come from ? Who invented it ? I don’t know. As far as I can remember it was there, fast, easy, effective. Really effective. So unbelievably useful I’ve never really understood why people would want to use something else. The reasons ? Most of the time, they tell me about floats, negative values, and why their new quick-sort code rocks.

Enough, I’m tired. Although the standard Radix Sort doesn’t work very well with floating point values, this is something actually very easy to fix. In this little article I will review the standard Radix Sort algorithm, and enhance it so that :

- it sorts negative floats as well

- it has reduced complexity for bytes and words

- it uses temporal coherence

- it supports sorting on multiple keys

Motivation

Is it worth writing anything in 2000 about a sort routine ? Doesn’t everyone already have one ? Aren’t those things already well known ? Everyone knows how to sort negative floats with a Radix, don’t you think ?

Well, that’s what I would’ve said some weeks ago. But I recently wandered on Ming C. Lin’s homepage [2], and began to read her courses. Here’s what I found :

« Counting sort and radix sort are good for integers. For floating point numbers, try bucket sort or other comparison-based methods » (21 sept. 1999) [3]

As you see this is not very old…I read more of the courses, read some older papers about collision detection, and it appeared my own way of dealing with a specific part of this problem was (at least from a theoretical point of view) faster than the official one. Hence, I think this article will be useful for beginners as well as for experienced programmers.

Review of the Radix Sort

A Radix Sort is an apparently bizarre sort routine which manages to sort values without actually performing any comparisons on input data. That’s why this sort routine breaks the theoretical lower bound of the O(N*logN) complexity, which only applies for comparison-based sorts. Radix is O(k*N), with k = 4 most of the time, and although this is not an in-place sort (i.e. it uses extra storage) it is so much faster than any other sorting methods it has become a very popular way of sorting data.

The algorithm

What is a radix, anyway ?

Roughly we can say a radix is a position in a number. In the decimal system, a radix is just a digit in a decimal number. For example the number « 42 » has two digits, or two radices, which are 4 and 2. In hexadecimal the radix is 8 bits wide. For example the hexadecimal number 0xAB has two radices, A and B. The Radix Sort gets its name from those radices, because the method first sorts the input values according to their first radix, then according to the second one, and so on. The Radix Sort is then a multipass sort, and the number of passes equals the number of radices in the input values. For example you’ll need 4 passes to sort standard 32 bits integers, since in hexadecimal the radix is a byte. By the way that’s why the Radix Sort is often called Byte Sort.

How does it work ? Say you want to sort some bytes, for example those ones:

54, 18, 2, 128, 3

The idea behind the Radix Sort is to read input values and immediately store them at the right place.

Have a look at that sample code :

unsigned char InputValues[] = { 54, 18, 2, 128, 3 };

int SortedBuffer[256];

memset(SortedBuffer, -1, 256*sizeof(int)); // Fill with –1

   for(int i=0 ;i<5 ;i++){

unsigned char c = InputValues[i];

     SortedBuffer[c] = c;

}

// Now you can read SortedBuffer and get values back in sorted order

You may think this example is stupid – it is ! – but it nicely introduces the ideas we’ll have to deal with. What do we need to change in that code, in order for it to become useful ? First we need to get rid of the empty destination locations, so that the destination buffer (called SortedBuffer in our example) has the same size as the input buffer (i.e. enough for 5 values, no more). We also need a way to handle collisions – collisions in the hash-table sense of the word, i.e. we must be able to deal with two equal input values and know how to store both of them in the final buffer. Luckily enough, both problems are solved by the same solution : an offset table.

The offset table is a 256-entries table telling us, for each possible input byte, where we should store the result. It is usually built in two passes, one to compute the distribution of bytes in the input flow (i.e. histograms, or counters), another one to create the offset table according to this distribution.

This sample code creates the counters:

int Counters[256];

memset(Counters, 0, 256*sizeof(int)); // Set all counters to 0

for( i =0 ; i < NbItems ; i++){ // Loop over the input array

unsigned char c = InputValues[i]; // Get current byte…

Counters[c]++; // …and update counter

}

We can now create the offset table :

   int OffsetTable[256];

   OffsetTable[0] = 0;

for(i=1;i<256;i++){

OffsetTable[i] = OffsetTable[i-1] + Counters[i-1];

}

Now, for each input byte, we can get the right offset back thanks to this table, put the input byte at the right place in the destination buffer, increase the offset, and repeat the sequence for the next byte.

For example :

unsigned char c = InputValues[i];

DestinationBuffer[OffsetTable[c]++] = c;

The destination buffer won’t have any empty locations : we have enough room in it for the same number of values as in the input buffer, no more. Thanks to the offsets we know exactly where we must store them, and whenever an empty location would have existed in our first example, here it doesn’t appear because no offset actually maps that empty location.

Collisions are not a problem either because offsets are increased each time we use one of them. If we have two identical bytes in input, the first one will be put in DestinationBuffer[Offset], Offset will be increased, and the next one will then fall in DestinationBuffer[Offset+1]. Radix sort is stable by the way, which means two same values in input will be in the same order in output.

This last point is very important to understand the next step : how can we extend the process in order to sort not only bytes but also words, dwords, or event character strings ? This is actually very simple : we just have to sort again according to the next radix. Recall a radix actually is a byte in the values to sort. The first radix is the LSB (Least Significant Byte). The last one must be the MSB (Most Significant Byte). Between those two, we perform all necessary passes. The second pass doesn’t destroy what has been done in the first one because the sorting method is stable.

Example : say we want to sort those hexadecimal values :

0xBC, 0xAB, 0xBA, 0xAC, 0xBB, 0xAA

The first pass leads to :

0xBA

0xAA

0xAB

0xBB

0xBC

0xAC

Note that the list is sorted according to the last column (ie the LSB which is the first radix). The second pass will sort this list according to the first column (ie the MSB, our second radix) and since the sorting method is stable, the 3 numbers 0xAA, 0xAB and 0xAC which share the same MSB will be found in output in the same order as in input. And since the input order has been determined by the first sorting pass, eh, it’s already sorted according to the LSB, and here we are with the final sorted list :

0xAA

0xAB

0xAC

0xBA

0xBB

0xBC

Radix selection is performed by shifting and ANDing the input value according to the pass number :

unsigned char Radix = (InputValues[i]>>(Pass<<3)) & 0xFF;

…where pass begins to 0 and gets increased for each new pass.

Sorting floating point values

Ok, that was the standard Radix Sort, as most of us knew it on old 16 bits computers. What happened then, was the following : one day, we all switched to 80486 (or Pentium) and suddenly we had floats. At first sight, a Radix Sort can’t deal with floats : the inner structure of a float isn’t obvious, you don’t really see how to shift them or AND a value with them (the algorithm involves such an operation, as seen just above), you don’t even know how to do it.

The first step is to learn how a floating-point value is actually built. This is well known nowadays : the first bit is the sign bit, the next 8 bits are the biased exponent, and the 23 last ones are the mantissa. The exponent is always positive thanks to the bias. Chris Hecker once wrote a good introduction paper to the numerous floating-points tricks you can afford on Intel processors [1]. Let’s quote him :

« Because the exponent are always positive (and are in more significant bits than the mantissa), large numbers compare greater than small numbers even when the floating-point values are compared as normal integer bits. The sign bit throws a monkey wrench in this, but it works great for single-signed values. »

I couldn’t have said that better. And from this point, it is easy to see why it works with positive floats. Let’s define a function IR(x) as the integer representation of any floating-point value x. For example the floating-point value 42.0 has a binary representation of 0x42280000. In other words :

IR(42.0) = 0x42280000

The Radix Sort works with positive floats because for any floating-point values x and y,

x > y >0 => IR(x) > IR(y)

Hence x and y will be treated by the sorting code as IR(x) and IR(y), and the final order will be correct. This is easy to check : just cast your float pointer into integer pointer and ask your code to sort your array of positive floats ! It works. Well, it doesn’t work on DEC Alpha because the float format is different, but that’s another story : it works provided you use IEEE floats, which means it works on PC, on Dreamcast, and so on.

Entering the negative zone

Things begin to be painful with negative floats. Recall the float inner structure : the sign bit is just the most significant one, and the only difference between x and –x is that sign bit. On one hand, this provides us with an ultra-fast fabs() routine (just one bit to clear):

inline float fabs(float x){

return (float&) ((unsigned int&)x)&0x7fffffff ;

}

On the other hand, we’re in trouble because for any floating-point values x and y,

x <y <0 => IR(x) > IR(y)

What does it mean ? Simple : the sorting code goes berserk, and sees our two negative floats as huge positive integers. Here’s what could be the possible output of a standard Radix Sort on 10 arbitrary floats :

2083.000000

2785.000000

8080.000000

10116.000000

10578.000000

12974.000000

-660.000000

-4906.000000

-10050.000000

-16343.000000

Not really what we want, and that’s why people usually say a Radix Sort doesn’t work with negative floats.

Oh, well. Let’s make it work then !

A nice and obvious thing to see is that the negative values are sorted anyway, they’re just at the wrong place (after the positive ones) and in the wrong order (big negative values are listed after the little ones).

The first possible hack is something like the « brute-force way of life » : read the sorted buffer in search of the first negative value (sublinear time), deduct from its position the number of negative floating-points values, copy all the positive floats at their right place in another buffer, then copy the remaining negative values in reverse order at the beginning of the new buffer. Ouch ! It works, but it’s quite ugly. Moreover, if you’re sorting a lot of values (I’m doing all my tests with 10000), this is far from optimal.

The elegant way is to tackle the problem before actually sorting the values.

First of all, we want to know how many negative values we’re dealing with. No need to read the input or output buffers once again : we actually already have this information, hidden in the counters. For integers as for floats, sign is determined by the most significant bit. So, what we want is just the number of entities whose most significant bit is 1. In our radix-process, those are to be found in the 128 last entries of the last counter. All we have to do is to sum them up :

NbNegativeValues = 0;

for(i=128;i<256;i++){

NbNegativeValues += LastCounter[i];

}

That is elegant. Of course if you’re sorting less than 128 values, you’d better stick to the brute-force solution. But a Radix Sort, due to the initial overhead, is not something to use blindly on little data sets. If you use it as I do, for example to sort your alpha-blended polygons (which are numerous nowadays), I assume the second method is a definitive winner.

Now comes the delicate part : we want to patch our offsets so that the final order is correct. We must fix two things, remember ? The wrong place and the wrong order. The easiest way is to take full control of the offsets by introducing a specific piece of code in the fourth pass of our sorting process. Once we’re isolated, we’re free to tweak our offsets the way we want.

Fixing offsets for positive values is a piece of cake. Let M be the number of negative values previously computed, we must assure positive floats begin after M negative ones. Fixing the wrong place is then done by forcing the first positive offset to be M:

Offset[0] = M;

Since the order was right for positive floats, we’re done.

Now for negative values. Recall from the result example of sorted floats that the very last sorted value (-16343.0) is the biggest negative float, or in other words it should be the very first sorted one. We can force that quite simply by clearing the offset relative to the very last value :

Offset[255] = 0;

From this point, we want the remaining negative values to go in increasing order, i.e. we must reverse the order of the initial Radix Sort. This is the most delicate part, which implies computing the offsets in a totally hacked way :

for(i=0;i<127;i++){

Offset[254-i] = Offset[255-i] + LastCounter[255-i];

}

This code computes the offsets ranging from 254 - 0 to 254 - 126, i.e. from 254 to 128, or in other words the offsets relative to negative values. Computing them in reverse order, starting from Offset[255] (which is null), allows our final list of sorted negative floats to be in the right order. The only little thing we must take care of is the way we update our offsets for negative values…. For positive ones offsets were increased. For negative ones they must be decreased. That is painful because our sorting loop must detect whether we’re handling a positive or negative value before updating the offset. Well. Usually I live with that. The corresponding code path is only used for the last sorting pass, and in practice the detection is nearly CPU-time free. Despite the crual lack of elegance, it sorts your negative floating-point values in linear time, and that’s what matters.

Reducing the complexity

The Radix Sort has a linear complexity of O(k*N). The crude algorithm uses one pass to create counters, and 4 passes to sort 32 bits values according to 4 different radices. Hence the complexity is O(5*N), this is the best as well as the worst running time for this algorithm, which always runs in constant time.

Nevertheless the information gathered by the very first pass can be used to improve things quite a bit. For example when sorting words instead of dwords, the two last passes are useless since all involved bytes are null – in other words, none of those two passes will change the order in which the list already is. The idea then, is to detect those pathological cases and take advantage of them. When a sorting pass is useless, we can just skip it.

How can we detect that the pass is useless ?

Of course we could trust the user, and our sorting code could have an input parameter, telling us to only perform one or two passes. But we can do better than that. This parameter can be computed in a very cheap way thanks to the counters we already have : we just have to check whether one of them is equal to the number of input values or not. If it is, then all bytes are the same for this pass and we can safely skip it. If it isn’t, well, we just do the normal job. Detection is very cheap since we only need to check 256 values in the worst case. In most cases we can immediately keep the pass (a counter is not null and not equal to the number of values) or discard it (a counter is equal to the number of values).

This is a very nice feature for our Radix Sort : now it automatically adapts itself to the input values. Hence, complexity is reduced to O(3*N) to sort words, and O(2*N) to sort bytes. Running time is also reduced for floating-point values sharing the same mantissa, or the same sign and exponent fields. It actually happens more often than one may think !

Temporal coherence

When input values are already sorted, even a simple Bubble Sort runs faster than a Radix Sort, because the Bubble Sort reads data once and immediately exits. In this case, the Radix Sort reads data once to create the counters, then again and again to actually sort the values.

To take advantage of temporal coherence in our Radix Sort, we can do a very little modification in the code which creates the counters. Since we already have to read the input buffer, there’s a very little price to pay to be able to detect already sorted input values. We just have to keep comparing the current input value to the previous one, and update some flag telling us we need sorting or not. Once again, this is performed in the already existing first loop, so the overhead is negligible : only a few assembler instructions. And now we can say when the input buffer is already sorted, and just exit in those cases.

Temporal coherence has a very interesting side-effect : the possibility to automatically support multiple sort keys. Say you’re working on a MAX exporter, and you want to transform MAX native data into something more hardware-friendly. You’ll end up dealing with a bunch of faces you’ll need to sort according to their rendering properties. For example each face may have a material ID and a smoothing group dword. The ideal sort-key would then be a 64 bits qword : the 32 bits material ID followed by 32 smoothing groups bits. With a standard routine you can sort according to the first key (e.g. the material ID), which leads to multiple groups of faces sharing the same material. Then, most of the time you call the sort routine again for each group to finish the job. This is not very elegant, and it quickly becomes a real mess when you have 3, 4 or even more rendering properties for each face. Our enhanced Radix Sort automatically takes advantage of temporal coherence, i.e. uses information from the previous sort to initiate the new one. As a result, all you have to do is to call the sort routine multiple times without doing anything between the calls. It just gets sorted in one line, as in the following code example.

// say you have N faces to sort

udword* MaterialID; // a list of N material Ids

udword* SmGrps; // a list of N smoothing groups

udword* Whatever; // a list of N rendering properties

// Create a Radix Sorter

RadixSorter Core;

// Multiple sorts

Core.Sort(MaterialID, N).Sort(SmGrps, N).Sort(Whatever, N);

// Get a list of N sorted indices, according to all rendering properties

udword* SortedList = Core.GetIndices();

Applications

Once your Radix Sort correctly handles floating-point values, there are tons of ways you can use it in Computer Graphics. Here are two obvious examples :

- Sorting transparent polygons : the correct display of alpha-blended polygons requires you sort them before sending them to the hardware, unless you use a ONE-ONE blending mode. Some people rely on a Bubble Sort to do that, because the temporal coherence usually makes it very efficient for this kind of stuff (polygons are almost already sorted from one frame to another). But a Bubble Sort is evil because it may lead to O(N^2) complexity. A Radix Sort is way safer, and can moreover also use temporal coherence.

- Collision detection: in Ming C. Lin’s GDC’2000 presentation about collision detection [4], you can find a chapter entitled « Use of sorting methods » which states the sweep-and-prune algorithm should use an initial quick-sort, then rely on temporal coherence to achieve linear running time. Actually my own sweep-and-prune code uses an enhanced Radix Sort and always runs in O(N), with or without temporal coherence. The code is simpler than the one included in standard collision detection packages such as V-Collide, it’s easier to understand, and theoretically runs just as fast. The now linear sweep-and-prune technique may also be extended to handle generic collisions of dynamic objects. For example it may be very efficient to handle inter-collisions between particles.

Source code

This article comes with a fully working enhanced Radix Sort (here!), written in C++ for ease of use. Some years ago it was worth recoding it in assembly language, unrolling some loops and so on. Nowadays I just use this piece of C++ code and let it go.

Have fun!

Useful links

[1] Chris Hecker’s paper about floating-point tricks:

“Let’s get to the (floating) point”, Game Developer February/March 1996 issue

[2] Ming C. Lin’s fascinating homepage:

http://www.cs.unc.edu/~lin/

[3] COMP122 - Algorithms and analysis (jump to the Linear-Time Sort section):

http://www.cs.unc.edu/~lin/COMP122-F99/

[4] The excellent GDC’2000 frames:

http://www.cs.unc.edu/~lin/gdc2000_files/frame.htm

Addendum March 09, 2007:

Seven years later (!!), I finally found the time and motivation to try Michael Herf’s “3 passes” radix sort. The package also contains a newer version of the usual 4-passes radix sort, with extra optimizations contributed by Kyle Hubert.

Click here to download the new source code.