RAPID Hack
Pierre Terdiman
Previous revision : May, 2000
Last revision : August, 2002
This little document presents several modifications of the RAPID library I
implemented in my own collision detection system (ZCollide).
All those modifications were made in order to reduce the memory
requirements of RAPID 2.01 [1]. Although very rapid indeed, the RAPID
library unfortunately consumes a very large amount of memory (about 412 bytes
per triangle, as reported in the original OBB Tree paper). Further research
papers have actually used that major flaw to emphasize how well their new
system behaved compared to RAPID. (see QuickCD [2] for example) As a
matter of facts, the original RAPID library is very easy to optimize – provided
we’re speaking about the number of bytes required.
Here’s the box class found in obb.H :
class box
{
public:
// placement in parent's space
// box to parent space: x_m = pR*x_b + pT
// parent to box space: x_b = pR.T()*(x_m 
pT)
double pR[3][3];
double pT[3];
// dimensions
double d[3]; // this is "radius", that is,
// half the measure of a
side length
box *P;
// points to but does not "own".
box *N;
tri *trp;
int leaf() { return (!P && !N); }
double size() { return d[0]; }
int split_recurse(int *t, int n);
int split_recurse(int *t); // specialized for leaf nodes
};
We have :
9 doubles for the rotation part, 72 bytes
3 doubles for the translation part, 24 bytes
3 doubles for the box extents, 24 bytes
1 box pointer (positive child), 4 bytes
1 box pointer (negative child), 4 bytes
1 triangle pointer, 4 bytes
Total : 132 bytes for each box. Total number of box is 2*number of
triangles – 1, so we need approximately 132 * 2 = 264 bytes per triangle.
This doesn’t take into account the actual triangles. In the original
paper those triangles are made of 3 vertices, or 9 doubles, ie 72 bytes. I
assume some standard indexed vertices may be used instead, and a triangle can
just be made of 3 word indices (6 bytes) or dword indices (12 bytes). This has
no importance : since the geometry is indexed, it always exists somewhere
as actual rendering data, and that’s why I don’t take it into account for the
OBBTree memory requirements.
1) Using singleprecision floatingpoint
values
This is of course the most trivial modification. Just replace all the
doubles with floats, it works just as well. In theory, this may conduct to a
loss of accuracy while recursively performing the collision queries. In
practice, I’ve never noticed any differences between the double and the single
precision version.
Anyway let’s quote the original paper:
« Single precision arithmetic can also be used to save
memory. »
Indeed. And our box now needs 132 – 36 – 12 – 12 = 72 bytes.
2) Wiping the negative box pointer out
Ok, I like RAPID very much. Honestly, I love that library, and the
algorithms behind. But when it comes to the implementation part, well, I’m
puzzled. What were they thinking ?
The negative box pointer actually is useless.
Here’s an excerpt from RAPID in build.cpp :
P = RAPID_boxes + RAPID_boxes_inited++;
N = RAPID_boxes + RAPID_boxes_inited++;
So what ?
The negative box will always be found just after the positive
one, and the negative box pointer can be made implicit. In other words, don’t
store it ! And when you need it in collide.cpp, just do :
N = P + 1 ;
Trust me, it works.
And the leaf test becomes :
int
leaf() { return (!P); }
…but don’t use that one, keep reading…
68 bytes.
3) Wiping the triangle pointer out
The triangle pointer is useless as well. It is always null, except for
leaf nodes. And leaf nodes, of course, don’t have any positive or negative
children since they’re leaf nodes. Hence, P and N are null for leaf nodes.
Obvious conclusion : we can just pack the triangle pointer in the P
pointer (N doesn’t exist anymore since 2)).
In order to do so, replace in build.cpp :
trp = ptr;
with
P = (box*)(unsigned int(ptr)  1);
Why setting the least significant bit ? Because we need a way to
detect leaf nodes anyway. The new leaf test is :
int leaf() { return unsigned int(P) & 1; }
Of course, I assume all box and triangle pointers have their original
least significant bit set to zero. This is a very cheap assumption, since it
just suppose data are 2bytes aligned. This should always happen, and in case
it doesn’t, forcing the alignment should be done anyway, for pure efficiency
reasons.
64 bytes.
1) Using quaternions
The rotation is stored with a 3*3 matrix. Let’s quote the original
paper :
« Using quaternions (…) results in substancial space savings, but
need 13 more operations per OBB overlap test. »
This is the usual tradeoff between memory and CPU time. Since we now
use singleprecision arithmetic, we actually already have won some CPU time,
and we can afford some more operations per overlap test.
Net saving : 4 floats instead of 9, we’re down to 44 bytes.
2) Wiping W out
Orientations are encoded by unit quaternions, so we can safely discard
one of the quaternion component (say W), because it’s implicit :
x^2 + y^2 + z^2 + w^2 = 1
ó
w^2 = 1 – x^2 
y^2  z^2
=> w = + /  sqrt(1 –
x^2  y^2  z^2)
We need to store one extra bit to make the difference between W and –W.
Fortunately there are a lot of places where we can store that extra bit :

in the second
least significant bit of the P / triangle pointer, which is just assuming data
are aligned on 4bytes boundaries

in the sign bit
of any of the 3 box extents – which are of course always positive, so 3 bits
are wasted there.
That modification also introduces an extra squareroot, but nowadays
it’s becoming really cheap. If needed we may use square root approximations, or
even some assembly code.
Another approach would have been to use Euler angles only.
Anyway, 40 bytes, which is 3.3 times better than the original code.
Updated august, 1, 2002 :
Stefano Bacelle sent a cool trick to get rid of the extra bit above :
unit quaternions Q and Q actually map the same rotation matrix. So, we don't
need to save the extra bit, we can just force W to be positive by negating the
whole quat if it is not. Works like the proverbial charm. This definitely makes
sense since Euler angles only need 3 floats to encode an orientation. So that
extra bit was redundant, one way or another. I just never realized it was so
simple to discard it. Thanks, Stefano !
3) Realtime compression
The OBB tree takes a lot of memory, but only a little subset is actually
needed to resolve collision queries at runtime. It should be possible to keep
the OBB tree compressed, and only decompress needed branches on the fly – which
of course saves a lot of memory, but needs some more CPU time.
Used in a standard rigidbody simulator, this method should be
efficient, because we often need to make multiple collision queries in search
of the exact collision time. If we use temporal coherence, we can decompress a
given branch the first time it’s needed, then keep it in a local cache when
searching for the exact collision time.
Anyway, this is something I haven’t actually tried – yet.
Updated august, 1, 2002 :
As promised I tried the approach in my own library, Opcode.
Related readings
[1] Gottschalk, S.,
M.C. Lin, and D. Manocha, « OBBTree : A Hierarchical Structure for
Rapid Interference Detection », Computer Graphics (SIGGRAPH’96
Proceedings), pp. 171180, August, 1996.
[2] Klosowski, James
Thomas, « Efficient Collision Detection for Interactive 3D Graphics and
Virtual Environments », PhD dissertation, May 1998