This is a recent mail I got from Stefan Gottschalk

This is a recent mail I got from Stefan Gottschalk. It contains quite interesting information about RAPID and how we could further improve that library.

Hi Pierre,

I enjoyed very much reading your comments on RAPID memory optimization on
http://www.codercorner.com/RAPID_Hack.htm.

You have some very good observations. I like your writing style and
attitude.

I was not surprised that you could reduce the memory footprint of RAPID by a
factor of 3.3, since I knew there was a lot of fat to be trimmed. However,
I was quite impressed with how you were able to achieve it with such small
changes to my code. Well done!

I thought you might be interested in the history and rationale of the design
of RAPID.

>Although very rapid indeed, the RAPID library unfortunately consumes
>a very large amount of memory (about 412 bytes per triangle, as reported in
>the original OBB Tree paper). Further research papers have actually used
>that major flaw to emphasize how well their new system behaved compared to
>RAPID. (see QuickCD [2] for example) As a matter of facts, the original
>RAPID library is very easy to optimize - provided we're speaking about the
>number of bytes required.

Yup, it's a real memory hog. The graphics group at UNC comp sci department
has its roots in virtual reality and realtime 3D graphics -- so the emphasis
was on speed, not memory footprint. Even though memory consumption was
horrendous, we had huge machines with huge amounts of memory, so I found
that the data structures were adequate on all but the largest models
(although I recall we had a 13 million polygon model, which I obviously
couldn't load). So, the emphasis was on speed and on robustness, with
memory footprint taking a backseat.

Later, when it turned out that game developers and folks working on PC
platforms were getting interested in RAPID, I got interested in reducing the
memory footprint. I thought about this and wrote some of it down in my
dissertation, but didn't pursue it very far. You implemented most of the
savings ideas mentioned in the paper (and very nicely, too! Nice job!). I
had some other observations, which I'll share with you now.

In addition to the things you address in the website, here is something I
didn't realize until after the paper was published: classes with virtual
functions carry a hidden cost. On one platform/compiler combo (g++ on an
SGI, I think) I found that virtual functions added at least 32 bytes to
every struct/class containing them. Ouch! RAPID 2.01 doesn't have virtual
functions, I think, but later implementations I experimented with did.
Also, virtual functions make it harder or impossible to optimize across
function invocations, because the compiler can't know at compile-time what
side effects the call might have (because it is a virtual function, it could
jump to just about anywhere!). So, when optimizing for memory size, avoid
virtual functions. I ran into this problem with a RAPID implementation
(never published) which supported OBB Trees, AABB Trees, and sphere trees
within the same framework. The virtual functions were killing me.

>This doesn't take into account the actual triangles. In the original
>paper those triangles are made of 3 vertices, or 9 doubles, ie 72 bytes. I
>assume some standard indexed vertices may be used instead, and a triangle
>can just be made of 3 word indices (6 bytes) or dword indices (12 bytes).
>This has no importance : since the geometry is indexed, it always exists
>somewhere as actual rendering data, and that's why I don't take it into
>account for the OBBTree memory requirements.

Exactly my thinking on the subject. Vertex pointers are nice. As long as
the app uses double[3] or float[3], the OBBT search structure can exist
alongside the vertex store, and just point at the vertices. However, some
rules need to be followed in this case: the app is NOT allowed to touch the
vertices. No morphing or editing the vertex locations, or all bets are off.
The library cannot mess with them, either.

I had originally considered pointers, but decided on making copies of the
geometry so that RAPID could process the vertex data in whatever way was
convenient, and likewise the app could do anything it wanted with its copy.
I don't recall if 2.01 did this, but one of the versions had the vertices
stored in the CS of the leaf box that contained them. This meant that I was
transforming my copy. It also precluded sharing vertices among triangles,
because different triangles were getting transformed into different box
CS's.

Robustness was paramount, and that includes making the library impervious to
careless app writers. By adhering to strict value semantics (making copies
of everything), I was insulating RAPID from potential side-effects of
application activity. That way, if someone issued a bug report, I didn't
have to consider how the app might be affecting my data structures.

If you go to pointers, you save memory, maybe even gain speed, but you give
up a little on the app/library division, and maintainability suffers just a
tad. Of course, if you are maintaining both the library and the app, this
really isn't an issue.

>Ok, I like RAPID very much. Honestly, I love that library, and the
>algorithms behind. But when it comes to the implementation part, well, I'm
>puzzled. What were they thinking ?

I feel compelled to respond to this. I can see why you'd think this was
boneheaded coding (I would too). I'll explain below.

>The negative box pointer actually is useless.
>N = P + 1 ;
>and
>Obvious conclusion : we can just pack the triangle pointer in the P
>pointer (N doesn't exist anymore since 2)).

Yup! Very astute. Originally, the left and right subtrees were
independently allocated from free store, so there was no particular
relationship between the N and P pointers. It worked great, but tree
building performance stank up the place. All those allocations were costing
time. So, I reorganized and allocated a single block of memory for all the
boxes at once. Actually, I think I did a lazy-size-doubling thing since I
didn't know how many vertices/faces there would be in advance, which means
O(log_2(N)) allocations and extra copying of the model data, but it was a
win anyway. Once I had all V vertices in hand, I knew that there would by
B=2V-1 boxes in my tree. So then I had this array in which sibling nodes
would occupy adjacent elements because of the way things were allocated, but
I didn't want to hardcode the relationship, just in case I wanted to go back
to a more general allocation scheme later.

But if we want to save space at any cost, we can take it even further. We
don't need to use P and N as pointers. Our boxes aren't going to be just
anywhere in the memory, but at specific address multiples beyond the base of
the box array. In other words, just think of P as an index rather than a
pointer, and use it as an index into the box array. And, as you observed,
sibling boxes occupy adjacent elements.

But not only that, *their children are usually nearby*. Usually, but not
always. So, consider this: don't store the child's position in the array,
but store it's location *relative to the parent*. If you do this, you will
find that you are storing lots of little numbers, and just a few big
numbers. The big numbers are kept in a separate, special array, and the
little numbers are encoded directly into the "child reference". So here's
the trick, the "child reference" for a box actually contains a bit for
whether the children are near or far, and the other bits are for an index
value. If the near bit indicates NEAR, then the indexes are actually
offsets from parent position in the box array. If the near bit indicates
FAR, then the index is an index into a special "far" array, whose elements
contain the larger offsets.

So what, you may ask? A number is a number. Pointer, index, what's the
difference? The difference is that pointers have to be 32 bits, and they
are more general than we require. Our special "child reference" can be
whatever number of bits we want, whether 8 or 16 or 32, or even stranger
numbers if you make the effort. Of course, using a limited number of bits
for the "child reference" limits how large your near offset is allowed to
be, and also limits how many elements you can index in the far array. The
smaller your NEAR offset limit, the more of the offsets have to be kept in
the FAR array, and when you run out of elements there, you're stuck, you
can't encode the tree structure. So there is a limit to how many boxes you
can store. But I worked it out, and we can use a 16-bit value to encode a
binary tree containing millions of boxes, which should be enough for most
applications. Push this out to 24-bits, and we're talking absolutely huge
models. Someday machines that use 64-bit pointers will be commonplace, and
we'll find this approach much more space efficient. I thnk the relationship
is N=O(log log V) where V is the number of vertices and N is the number of
bits in the child reference. I haven't studied how 8-bit "child reference"
limit the size of the models we can encode. It may be large enough for some
games, but I'm not sure. Probably need 10-bit references to get models of
several thousand polygons.

As you mentioned above, you also want a bit for indicating whether this is a
leaf node, in which case the NEAR/FAR bit and index are used to locate
triangle objects. That steals a bit from our "child reference". The
relative addressing is trickier in the case of triangles, because there are
twice as many boxes as there are triangles in the binary OBB Tree, but it
still works pretty well. You just take the current box index, multiply by
two, and use that as the base triangle index. Then you add the encoded
offset (whether near or far).

>The OBB tree takes a lot of memory, but only a little subset is
>actually needed to resolve collision queries at runtime. It should be
>possible to keep the OBB tree compressed, and only decompress needed
>branches on the fly - which of course saves a lot of memory, but needs some
>more CPU time.
>Used in a standard rigid-body simulator, this method should be
>efficient, because we often need to make multiple collision queries in
>search of the exact collision time. If we use temporal coherence, we can
>decompress a given branch the first time it's needed, then keep it in a
>local cache when searching for the exact collision time

Excellent observations. I never thought of the temporal coherence aspect.
Depending on the degree of coherence, a seemingly "inefficient"
decompression method may still be a win, because the work may have to be
done so infrequently that the extra overhead is negligible.

One very important effect of reducing the memory footprint of these trees is
that the number of cache misses is also reduced, which may in turn lead to
faster execution! Pretty cool: faster code AND less memory. Compression
techniques in general chew up CPU time in exchange for lesser memory
bandwidth demand. Given that CPU speed is increasing faster than memory
speed, this can result in faster running times overall, with the gap
widening as technology progresses. So these ideas have a future!

-stefan