Radix Sort Revisited
Pierre Terdiman
Last revision: 04.01.2000
In every decent
programmer’s toolbox lies a strange weapon called a Radix Sort. Where does it
come from ? Who invented it ? I don’t know. As far as I can remember
it was there, fast, easy, effective. Really effective. So unbelievably
useful I’ve never really understood why people would want to use something
else. The reasons ? Most of the time, they tell me about floats, negative
values, and why their new quicksort code rocks.
Enough, I’m tired.
Although the standard Radix Sort doesn’t work very well with floating point
values, this is something actually very easy to fix. In this little article I
will review the standard Radix Sort algorithm, and enhance it so that :

it sorts
negative floats as well

it has
reduced complexity for bytes and words

it uses
temporal coherence

it
supports sorting on multiple keys
Is it worth writing
anything in 2000 about a sort routine ? Doesn’t everyone already have
one ? Aren’t those things already well known ? Everyone knows how to
sort negative floats with a Radix, don’t you think ?
Well, that’s what I
would’ve said some weeks ago. But I recently wandered on Ming C. Lin’s homepage
[2], and began to read her courses. Here’s what I found :
« Counting sort
and radix sort are good for integers. For floating point numbers, try bucket
sort or other comparisonbased methods » (21 sept. 1999) [3]
As you see this is
not very old…I read more of the courses, read some older papers about collision
detection, and it appeared my own way of dealing with a specific part of this
problem was (at least from a theoretical point of view) faster than the
official one. Hence, I think this article will be useful for beginners as well
as for experienced programmers.
A Radix Sort is an
apparently bizarre sort routine which manages to sort values without actually
performing any comparisons on input data. That’s why this sort routine breaks
the theoretical lower bound of the O(N*logN) complexity, which only applies for
comparisonbased sorts. Radix is O(k*N), with k = 4 most of the time, and
although this is not an inplace sort (i.e. it uses extra storage) it is so
much faster than any other sorting methods it has become a very popular way of sorting
data.
The algorithm
What is a radix, anyway ?
Roughly we can say a radix is a position in a number. In the decimal
system, a radix is just a digit in a decimal number. For example the number
« 42 » has two digits, or two radices, which are 4 and 2. In
hexadecimal the radix is 8 bits wide. For example the hexadecimal number 0xAB
has two radices, A and B. The Radix Sort gets its name from those radices,
because the method first sorts the input values according to their first radix,
then according to the second one, and so on. The Radix Sort is then a multipass
sort, and the number of passes equals the number of radices in the input
values. For example you’ll need 4 passes to sort standard 32 bits integers,
since in hexadecimal the radix is a byte. By the way that’s why the Radix Sort
is often called Byte Sort.
How does it work ? Say you want to sort some bytes, for
example those ones:
54,
18, 2, 128, 3
The idea behind the Radix Sort is to read input values and immediately
store them at the right place.
Have a look at that sample code :
unsigned char InputValues[] = { 54, 18, 2, 128, 3 };
int SortedBuffer[256];
memset(SortedBuffer, 1, 256*sizeof(int)); // Fill with –1
for(int i=0 ;i<5 ;i++){
unsigned char c = InputValues[i];
SortedBuffer[c] = c;
}
// Now you can read SortedBuffer and get
values back in sorted order
You may think this example is stupid – it is ! – but it
nicely introduces the ideas we’ll have to deal with. What do we need to change
in that code, in order for it to become useful ? First we need to get rid
of the empty destination locations, so that the destination buffer (called
SortedBuffer in our example) has the same size as the input buffer (i.e. enough
for 5 values, no more). We also need a way to handle collisions – collisions in
the hashtable sense of the word, i.e. we must be able to deal with two equal
input values and know how to store both of them in the final buffer. Luckily
enough, both problems are solved by the same solution : an offset table.
The offset table is a 256entries table telling us, for each possible
input byte, where we should store the result. It is usually built in two
passes, one to compute the distribution of bytes in the input flow (i.e.
histograms, or counters), another one to create the offset table
according to this distribution.
This sample code creates the counters:
int
Counters[256];
memset(Counters, 0, 256*sizeof(int)); //
Set all counters to 0
for(
i =0 ; i < NbItems ; i++){ //
Loop over the input array
unsigned char c = InputValues[i]; //
Get current byte…
Counters[c]++; //
…and update counter
}
We can now create the offset table :
int OffsetTable[256];
OffsetTable[0] = 0;
for(i=1;i<256;i++){
OffsetTable[i] = OffsetTable[i1] +
Counters[i1];
}
Now, for each input byte, we can
get the right offset back thanks to this table, put the input byte at the right
place in the destination buffer, increase the offset, and repeat the sequence
for the next byte.
For example :
unsigned char c =
InputValues[i];
DestinationBuffer[OffsetTable[c]++] = c;
The destination buffer won’t have
any empty locations : we have enough room in it for the same number of
values as in the input buffer, no more. Thanks to the offsets we know exactly
where we must store them, and whenever an empty location would have existed in
our first example, here it doesn’t appear because no offset actually maps that
empty location.
Collisions are not a problem
either because offsets are increased each time we use one of them. If we have
two identical bytes in input, the first one will be put in
DestinationBuffer[Offset], Offset will be increased, and the next one will then
fall in DestinationBuffer[Offset+1]. Radix sort is stable by the way,
which means two same values in input will be in the same order in output.
This last point is very important
to understand the next step : how can we extend the process in order to
sort not only bytes but also words, dwords, or event character strings ?
This is actually very simple : we just have to sort again according to the
next radix. Recall a radix actually is a byte in the values to sort. The first
radix is the LSB (Least Significant Byte). The last one must be the MSB (Most
Significant Byte). Between those two, we perform all necessary passes. The
second pass doesn’t destroy what has been done in the first one because the
sorting method is stable.
Example : say we want to sort
those hexadecimal values :
0xBC, 0xAB, 0xBA, 0xAC, 0xBB, 0xAA
The first pass leads to :
0xBA
0xAA
0xAB
0xBB
0xBC
0xAC
Note that the list is sorted
according to the last column (ie the LSB which is the first radix). The second
pass will sort this list according to the first column (ie the MSB, our second
radix) and since the sorting method is stable, the 3 numbers 0xAA, 0xAB and
0xAC which share the same MSB will be found in output in the same order as in
input. And since the input order has been determined by the first sorting pass,
eh, it’s already sorted according to the LSB, and here we are with the final
sorted list :
0xAA
0xAB
0xAC
0xBA
0xBB
0xBC
Radix selection is
performed by shifting and ANDing the input value according to the pass
number :
unsigned
char Radix = (InputValues[i]>>(Pass<<3)) & 0xFF;
…where pass begins
to 0 and gets increased for each new pass.
Sorting floating point values
Ok, that was the
standard Radix Sort, as most of us knew it on old 16 bits computers. What
happened then, was the following : one day, we all switched to 80486 (or
Pentium) and suddenly we had floats. At first sight, a Radix Sort can’t deal
with floats : the inner structure of a float isn’t obvious, you don’t
really see how to shift them or AND a value with them (the algorithm involves
such an operation, as seen just above), you don’t even know how to do it.
The first step is to
learn how a floatingpoint value is actually built. This is well known
nowadays : the first bit is the sign bit, the next 8 bits are the biased
exponent, and the 23 last ones are the mantissa. The exponent is always
positive thanks to the bias. Chris Hecker once wrote a good introduction paper
to the numerous floatingpoints tricks you can afford on Intel processors [1].
Let’s quote him :
« Because the
exponent are always positive (and are in more significant bits than the
mantissa), large numbers compare greater than small numbers even when the
floatingpoint values are compared as normal integer bits. The sign bit throws
a monkey wrench in this, but it works great for singlesigned values. »
I couldn’t have said
that better. And from this point, it is easy to see why it works with positive
floats. Let’s define a function IR(x) as the integer representation of any
floatingpoint value x. For example the floatingpoint value 42.0 has a binary
representation of 0x42280000. In other words :
IR(42.0)
= 0x42280000
The Radix Sort works
with positive floats because for any floatingpoint values x and y,
x
> y >0 => IR(x) > IR(y)
Hence x and y will
be treated by the sorting code as IR(x) and IR(y), and the final order will be
correct. This is easy to check : just cast your float pointer into integer
pointer and ask your code to sort your array of positive floats ! It
works. Well, it doesn’t work on DEC Alpha because the float format is
different, but that’s another story : it works provided you use IEEE
floats, which means it works on PC, on Dreamcast, and so on.
Things begin to be painful with
negative floats. Recall the float inner structure : the sign bit is just
the most significant one, and the only difference between x and –x is that sign
bit. On one hand, this provides us with an ultrafast fabs() routine (just
one bit to clear):
inline float
fabs(float x){
return
(float&) ((unsigned int&)x)&0x7fffffff ;
}
On the other hand, we’re in
trouble because for any floatingpoint values x and y,
x <y <0 => IR(x) > IR(y)
What does it mean ?
Simple : the sorting code goes berserk, and sees our two negative floats
as huge positive integers. Here’s what could be the possible output of a
standard Radix Sort on 10 arbitrary floats :
2083.000000
2785.000000
8080.000000
10116.000000
10578.000000
12974.000000
660.000000
4906.000000
10050.000000
16343.000000
Not really what we want, and
that’s why people usually say a Radix Sort doesn’t work with negative floats.
Oh, well. Let’s make it work
then !
A nice and obvious thing to see is
that the negative values are sorted anyway, they’re just at the wrong place
(after the positive ones) and in the wrong order (big negative values are
listed after the little ones).
The first possible hack is
something like the « bruteforce way of life » : read the sorted
buffer in search of the first negative value (sublinear time), deduct from its
position the number of negative floatingpoints values, copy all the positive
floats at their right place in another buffer, then copy the remaining negative
values in reverse order at the beginning of the new buffer. Ouch ! It
works, but it’s quite ugly. Moreover, if you’re sorting a lot of values (I’m
doing all my tests with 10000), this is far from optimal.
The elegant way is to tackle the
problem before actually sorting the values.
First of all, we want to know how
many negative values we’re dealing with. No need to read the input or output
buffers once again : we actually already have this information, hidden in
the counters. For integers as for floats, sign is determined by the most
significant bit. So, what we want is just the number of entities whose most
significant bit is 1. In our radixprocess, those are to be found in the 128
last entries of the last counter. All we have to do is to sum them up :
NbNegativeValues
= 0;
for(i=128;i<256;i++){
NbNegativeValues += LastCounter[i];
}
That is elegant. Of
course if you’re sorting less than 128 values, you’d better stick to the
bruteforce solution. But a Radix Sort, due to the initial overhead, is not
something to use blindly on little data sets. If you use it as I do, for
example to sort your alphablended polygons (which are numerous nowadays), I
assume the second method is a definitive winner.
Now comes the delicate part :
we want to patch our offsets so that the final order is correct. We must fix
two things, remember ? The wrong place and the wrong order. The easiest
way is to take full control of the
offsets by introducing a specific piece of code in the fourth pass of our
sorting process. Once we’re isolated, we’re free to tweak our offsets the way
we want.
Fixing offsets for positive
values is a piece of cake. Let M be the number of negative values
previously computed, we must assure positive floats begin after M negative
ones. Fixing the wrong place is then done by forcing the first positive
offset to be M:
Offset[0] = M;
Since the order was right for
positive floats, we’re done.
Now for negative values. Recall
from the result example of sorted floats that the very last sorted value
(16343.0) is the biggest negative float, or in other words it should be the
very first sorted one. We can force that quite simply by clearing the
offset relative to the very last value :
Offset[255] = 0;
From this point, we want the
remaining negative values to go in increasing order, i.e. we must reverse the
order of the initial Radix Sort. This is the most delicate part, which implies computing
the offsets in a totally hacked way :
for(i=0;i<127;i++){
Offset[254i] = Offset[255i] + LastCounter[255i];
}
This code computes the offsets
ranging from 254  0 to 254  126, i.e. from 254 to 128, or in other words the
offsets relative to negative values. Computing them in reverse order, starting
from Offset[255] (which is null), allows our final list of sorted negative
floats to be in the right order. The only little thing we must take care of is
the way we update our offsets for negative values…. For positive ones offsets
were increased. For negative ones they must be decreased. That is
painful because our sorting loop must detect whether we’re handling a positive
or negative value before updating the offset. Well. Usually I live with that.
The corresponding code path is only used for the last sorting pass, and in
practice the detection is nearly CPUtime free. Despite the crual lack of
elegance, it sorts your negative floatingpoint values in linear time, and
that’s what matters.
The Radix Sort has a
linear complexity of O(k*N). The crude algorithm uses one pass to create
counters, and 4 passes to sort 32 bits values according to 4 different radices.
Hence the complexity is O(5*N), this is the best as well as the worst running
time for this algorithm, which always runs in constant time.
Nevertheless the
information gathered by the very first pass can be used to improve things quite
a bit. For example when sorting words instead of dwords, the two last passes
are useless since all involved bytes are null – in other words, none of those
two passes will change the order in which the list already is. The idea then,
is to detect those pathological cases and take advantage of them. When a
sorting pass is useless, we can just skip it.
How can we detect
that the pass is useless ?
Of course we could
trust the user, and our sorting code could have an input parameter, telling us
to only perform one or two passes. But we can do better than that. This
parameter can be computed in a very cheap way thanks to the counters we already
have : we just have to check whether one of them is equal to the number of
input values or not. If it is, then all bytes are the same for this pass and we
can safely skip it. If it isn’t, well, we just do the normal job. Detection is
very cheap since we only need to check 256 values in the worst case. In most
cases we can immediately keep the pass (a counter is not null and not equal to
the number of values) or discard it (a counter is equal to the number of
values).
This is a very nice
feature for our Radix Sort : now it automatically adapts itself to
the input values. Hence, complexity is reduced to O(3*N) to sort words, and
O(2*N) to sort bytes. Running time is also reduced for floatingpoint values
sharing the same mantissa, or the same sign and exponent fields. It actually
happens more often than one may think !
When input values are already
sorted, even a simple Bubble Sort runs faster than a Radix Sort, because the
Bubble Sort reads data once and immediately exits. In this case, the Radix Sort
reads data once to create the counters, then again and again to actually sort
the values.
To take advantage of temporal
coherence in our Radix Sort, we can do a very little modification in the code
which creates the counters. Since we already have to read the input buffer,
there’s a very little price to pay to be able to detect already sorted input
values. We just have to keep comparing the current input value to the previous
one, and update some flag telling us we need sorting or not. Once again, this
is performed in the already existing first loop, so the overhead is
negligible : only a few assembler instructions. And now we can say when
the input buffer is already sorted, and just exit in those cases.
Temporal coherence has a very
interesting sideeffect : the possibility to automatically support
multiple sort keys. Say you’re working on a MAX exporter, and you want to
transform MAX native data into something more hardwarefriendly. You’ll end up
dealing with a bunch of faces you’ll need to sort according to their rendering
properties. For example each face may have a material ID and a smoothing group
dword. The ideal sortkey would then be a 64 bits qword : the 32 bits material
ID followed by 32 smoothing groups bits. With a standard routine you can sort
according to the first key (e.g. the material ID), which leads to multiple
groups of faces sharing the same material. Then, most of the time you call the
sort routine again for each group to finish the job. This is not very elegant,
and it quickly becomes a real mess when you have 3, 4 or even more rendering
properties for each face. Our enhanced Radix Sort automatically takes advantage
of temporal coherence, i.e. uses information from the previous sort to initiate
the new one. As a result, all you have to do is to call the sort routine
multiple times without doing anything between the calls. It just gets
sorted in one line, as in the following code example.
// say you have N faces to sort
udword*
MaterialID; // a list of N material Ids
udword* SmGrps; // a list of N smoothing groups
udword* Whatever; // a list of N rendering properties
// Create a Radix
Sorter
RadixSorter Core;
// Multiple sorts
Core.Sort(MaterialID,
N).Sort(SmGrps, N).Sort(Whatever, N);
// Get a list of
N sorted indices, according to all rendering properties
udword*
SortedList = Core.GetIndices();
Applications
Once your Radix Sort correctly
handles floatingpoint values, there are tons of ways you can use it in
Computer Graphics. Here are two obvious examples :

Sorting transparent polygons : the
correct display of alphablended polygons requires you sort them before sending
them to the hardware, unless you use a ONEONE blending mode. Some people rely
on a Bubble Sort to do that, because the temporal coherence usually makes it
very efficient for this kind of stuff (polygons are almost already sorted from
one frame to another). But a Bubble Sort is evil because it may lead to
O(N^2) complexity. A Radix Sort is way safer, and can moreover also use
temporal coherence.

Collision detection: in Ming C. Lin’s GDC’2000 presentation
about collision detection [4], you can find a chapter entitled « Use of
sorting methods » which states the sweepandprune algorithm should
use an initial quicksort, then rely on temporal coherence to achieve linear
running time. Actually my own sweepandprune code uses an enhanced Radix Sort
and always runs in O(N), with or without
temporal coherence. The code is simpler than the one included in standard
collision detection packages such as VCollide, it’s easier to understand, and
theoretically runs just as fast. The now linear sweepandprune technique may
also be extended to handle generic collisions of dynamic objects. For example
it may be very efficient to handle intercollisions between particles.
Source code
This article comes with a fully working enhanced Radix Sort (here!), written in C++ for
ease of use. Some years ago it was worth recoding it in assembly language,
unrolling some loops and so on. Nowadays I just use this piece of C++ code and
let it go.
Have fun!
Useful links
[1] Chris Hecker’s paper about floatingpoint
tricks:
“Let’s get to the (floating) point”, Game Developer February/March 1996
issue
[2] Ming C. Lin’s fascinating homepage:
[3] COMP122  Algorithms and analysis (jump to the LinearTime Sort
section):
http://www.cs.unc.edu/~lin/COMP122F99/
[4] The excellent GDC’2000 frames:
http://www.cs.unc.edu/~lin/gdc2000_files/frame.htm
Addendum March 09, 2007:
Seven years later (!!), I finally found the time and motivation to try Michael Herf’s
“3 passes” radix sort. The package also contains a newer version of the usual
4passes radix sort, with extra optimizations contributed by Kyle Hubert.