Batching incoherent queries

For a binary tree, a typical recursive collision detection function usually looks like this (in this example for a raycast):

void MyTree::Raycast(const Ray& ray, const Node* node)
{
// Perform ray-AABB overlap test
if(!RayAABBOverlap(ray, node->GetAABB()))
return;

// Ray touches the AABB. If the node is a leaf, test against triangle
if(node->IsLeaf())
{
if(RayTriangleOverlap(ray, node->GetTriangle())
RegisterHit();
}
else
{
// Internal node => recurse to left & right children
Raycast(ray, node->GetLeftChild());
Raycast(ray, node->GetRightChild());
}

}

This traditional tree traversal code has a few problems:

  • A vanilla version suffers from many cache misses since we keep jumping from one node to its children, and there is no guarantee that those child nodes will be close to us.
  • Prefetching might help. However a typical prefetch instruction has a relatively high latency, i.e. you need to wait a certain amount of cycles after the prefetch instruction has been issued, before the requested address is effectively “free” to read. The problem with a binary tree such as the above is that there is just not enough work to do in one node (basically just one ray-AABB test).
  • The problem is the same on platforms where you need to DMA the next nodes. Even if you start your DMA right when entering the function, it may not be finished by the time you reach the end of it.

There are different ways to combat this issue. You can try a more cache-friendly tree layout, say van Emde Boas. You can try a more cache-friendly tree, say N-ary instead of binary. But those alternatives often come with their own set of problems, and most lack the simplicity of a binary tree.

So instead, I would like to do my Mike Acton and ask: when is the last time you had to raycast one ray against that tree? This is not the typical case. The typical case, the generic case even, is when you end up with many raycasts against the same tree.

This kind of batched queries has been done before for raycasts. I think it is called “packet query”. But as far as I know, it has always been described for “coherent rays” (and only 4 at a time IIRC), i.e. the kind of rays you get in raytracing when firing N rays from the same pixel (say for supersampling), or from close pixels. In other words the rays are very similar to each other, going in the same direction, and thus they are likely to touch the same nodes during the recursive tree traversal.

But where is this restriction coming from? Who said the rays had to be coherent? They do not have to. It is equally simple to collide N incoherent rays against a mesh.

Here is how.

We need a mesh, i.e. a binary tree, and then a linear array of rays. I will give the code first and then comment it. The code looks roughly like this:

void MyTree::MultiRaycast(int nbRays, const Ray* rays, const Node* node)
{
// Collide all rays against current AABB
int Offset;
{
const AABB& Box = node->GetAABB();

Offset = 0;
int Last = nbRays;

while(Offset!=Last)
{
if(RayAABBOverlap(rays[Offset], Box))
{
Offset++;
}
else
{
// Do a ‘move-to-front’ transform on rays that touch the box
Last–;
Swap(rays[Offset], rays[Last]);
}
}
if(!Offset)
return;
}

// Here, Offset = number of rays that touched the box. The array has been reorganized so
// that those rays that passed the test are now at the beginning of the array. The rays
// that did not touch the box are all at the end.

// If the node is a leaf, test surviving rays against triangle
if(node->IsLeaf())
{
const Triangle T = node->GetTriangle();
for(int i=0;i<Offset;i++)
{
if(RayTriangleOverlap(rays[i], T)
RegisterHit();
}
}
else
{
// Internal node => recurse to left & right children
MultiRaycast(Offset, rays, node->GetLeftChild());
MultiRaycast(Offset, rays, node->GetRightChild());
}

}

So first we test all incoming rays against the node’s AABB. When doing so, we reorganize the rays so that the ones that passed the test are now at the beginning of the array. The rays that did not touch the box are all at the end. This is easily done with a simple ‘move-to-front’ operation. Astute readers will probably notice that this is similar to how the tree itself was built in Opcode, when the ‘positive’ and ‘negative’ primitives got reorganized at build time, in a similar fashion.

After the AABB test, if the node is a leaf, we simply test all surviving rays against the node’s triangle. Otherwise we just recurse as usual, with a simple twist: we pass down the number of surviving rays, not the original number.

And that’s it, done. Simple enough, right? The only vaguely subtle conceptual difficulty is how the recursive descent does not invalidate the re-ordering we did in the parent node. This is simply because regardless of how the child nodes reorganize the array, they are only going to reorganize the number we give them, i.e. the surviving rays. And no matter how those guys get shuffled down the tree, when the code comes back from the recursion (say after visiting the left child), we still pass a valid array of surviving rays to the right child. They are not in the same order anymore, but the permutation has only been done on the first half of the array, so we are still good to go.

So what do we get out of this? Well, a lot:

  • We are now working on N rays at a time, not just one. So that is a lot of work to do in each node, and there is plenty of time for your prefetch or your DMAs to finish.
  • The N rays we operate on are stored in a contiguous array, so accessing them is still very cache friendly. It would not be the case if we would have put all our rays in a hierarchy for example, and collided one ray-tree against the mesh.
  • We have N rays at a time, to collide against the same box or the same triangle. This is a perfect SIMD-friendly situation.
  • We traverse the tree only once now, so even if your tree layout is not cache-friendly at all, the number of traversal-related cache misses is minimized overall. In fact you can imagine it has just been divided by the number of batched rays…
  • We traverse each node only once now, so any extra work that you sometimes must do to access the AABB, like dequantization, is now only performed once instead of N times. Think about it: if you raycast N rays against the same mesh, you are dequantizing the top level AABB N times for example. With this new approach however, you do that only once.
  • In a similar way, we fetch a candidate triangle only once and raycast N rays against it. So the penalties associated with getting the triangle data (and there are quite a few, usually) are also reduced when multiple rays end up touching the same triangle.

I illustrated this idea with raycasts, but of course it works equally well with overlap tests or sweeps.

I do not know if it is a new idea or not, but I haven’t seen it described anywhere before.

4 Responses to “Batching incoherent queries”



  1. Brian Marshall Says:

    Neat. The move to front list reminds me of triage tables used by Jim Blinn when describing Warnock’s algorithm.

    Warnock has polygons in 3 categories. Surrounding, intersection and disjoint. Rather than intersecting/non-intersecting.

    Jim Blinn’s description of it is in his “Triage Tables” article in his Dirty Pixels book. Well worth reading if you’ve not seen it.

    -Brian.

  2. David Skyfall Says:

    I’ve seen it a few times, I can’t point you in any directions though.
    There a few problems, for example it’s to serial in nature for example gpus’. But I think the most common problem is you traverse in the same order so you can’t optimize for the closest hit point. For a very dense mesh that could be a severe performance penalty. There are a few cures though, split the rays in a few groups based on direction and traverse the tree based on the first alive ray (always first ray).

  3. admin Says:

    I have the Jim Blinn books and you’re completely right, the triage tables used a very similar trick!

    The approach is more serial than before only if you put all rays in one batch. But you don’t have to. If you have 4 work threads and 4000 rays to collide, just put 1000 rays per thread. That’s still 4 traversals instead of 4000.

    However it’s absolutely correct that it is now more difficult to do an informed traversal when looking for closest hits.

  4. Anders Nilsson Says:

    Shares goal with the \Faster Incoherent Rays: Multi-BVH Ray Stream Tracing\ paper (http://www.eng.uwaterloo.ca/~jtsakok/mbvhrs.pdf). The latter is mostly intended for CPU raytracing. I really liked that paper. Results seems to depend on the exact circumstances (coherency in queries, tree type etc).

shopfr.org cialis