People are confused

May 15th, 2012

It’s amazing how people are clueless about PhysX. It’s not the first time I read this kind of stuff on forums, but since this one is recent I’ll make a note here:

http://www.gamespot.com/forums/topic/29130252/what-do-you-expect-next-gen-for-dx1211.1-opengl-physx-havok-and-other-apis

“Havok has the advantage on PC for physics since Nvdia GPU accelerated physx is limited to their cards”

This is bullshit, plain and simple. Both Havok and PhysX work perfectly fine in software. PhysX works on anything from PC to PS3/Xbox/Mobile/etc. Basically they are similar libraries doing similar things.

But on top of that, PhysX accelerates some features on Nvidia GPU, while Havok does not. And PhysX is fully free, while you need to pay for Havok in commercial games. PhysX has the advantage here.

There are also differences in terms of performance, memory usage, robustness, features, API user-friendliness, etc. No clear winner in any of those.

Batching incoherent queries

April 7th, 2012

For a binary tree, a typical recursive collision detection function usually looks like this (in this example for a raycast):

void MyTree::Raycast(const Ray& ray, const Node* node)
{
// Perform ray-AABB overlap test
if(!RayAABBOverlap(ray, node->GetAABB()))
return;

// Ray touches the AABB. If the node is a leaf, test against triangle
if(node->IsLeaf())
{
if(RayTriangleOverlap(ray, node->GetTriangle())
RegisterHit();
}
else
{
// Internal node => recurse to left & right children
Raycast(ray, node->GetLeftChild());
Raycast(ray, node->GetRightChild());
}

}

This traditional tree traversal code has a few problems:

  • A vanilla version suffers from many cache misses since we keep jumping from one node to its children, and there is no guarantee that those child nodes will be close to us.
  • Prefetching might help. However a typical prefetch instruction has a relatively high latency, i.e. you need to wait a certain amount of cycles after the prefetch instruction has been issued, before the requested address is effectively “free” to read. The problem with a binary tree such as the above is that there is just not enough work to do in one node (basically just one ray-AABB test).
  • The problem is the same on platforms where you need to DMA the next nodes. Even if you start your DMA right when entering the function, it may not be finished by the time you reach the end of it.

There are different ways to combat this issue. You can try a more cache-friendly tree layout, say van Emde Boas. You can try a more cache-friendly tree, say N-ary instead of binary. But those alternatives often come with their own set of problems, and most lack the simplicity of a binary tree.

So instead, I would like to do my Mike Acton and ask: when is the last time you had to raycast one ray against that tree? This is not the typical case. The typical case, the generic case even, is when you end up with many raycasts against the same tree.

This kind of batched queries has been done before for raycasts. I think it is called “packet query”. But as far as I know, it has always been described for “coherent rays” (and only 4 at a time IIRC), i.e. the kind of rays you get in raytracing when firing N rays from the same pixel (say for supersampling), or from close pixels. In other words the rays are very similar to each other, going in the same direction, and thus they are likely to touch the same nodes during the recursive tree traversal.

But where is this restriction coming from? Who said the rays had to be coherent? They do not have to. It is equally simple to collide N incoherent rays against a mesh.

Here is how.

We need a mesh, i.e. a binary tree, and then a linear array of rays. I will give the code first and then comment it. The code looks roughly like this:

void MyTree::MultiRaycast(int nbRays, const Ray* rays, const Node* node)
{
// Collide all rays against current AABB
int Offset;
{
const AABB& Box = node->GetAABB();

Offset = 0;
int Last = nbRays;

while(Offset!=Last)
{
if(RayAABBOverlap(rays[Offset], Box))
{
Offset++;
}
else
{
// Do a ‘move-to-front’ transform on rays that touch the box
Last–;
Swap(rays[Offset], rays[Last]);
}
}
if(!Offset)
return;
}

// Here, Offset = number of rays that touched the box. The array has been reorganized so
// that those rays that passed the test are now at the beginning of the array. The rays
// that did not touch the box are all at the end.

// If the node is a leaf, test surviving rays against triangle
if(node->IsLeaf())
{
const Triangle T = node->GetTriangle();
for(int i=0;i<Offset;i++)
{
if(RayTriangleOverlap(rays[i], T)
RegisterHit();
}
}
else
{
// Internal node => recurse to left & right children
MultiRaycast(Offset, rays, node->GetLeftChild());
MultiRaycast(Offset, rays, node->GetRightChild());
}

}

So first we test all incoming rays against the node’s AABB. When doing so, we reorganize the rays so that the ones that passed the test are now at the beginning of the array. The rays that did not touch the box are all at the end. This is easily done with a simple ‘move-to-front’ operation. Astute readers will probably notice that this is similar to how the tree itself was built in Opcode, when the ‘positive’ and ‘negative’ primitives got reorganized at build time, in a similar fashion.

After the AABB test, if the node is a leaf, we simply test all surviving rays against the node’s triangle. Otherwise we just recurse as usual, with a simple twist: we pass down the number of surviving rays, not the original number.

And that’s it, done. Simple enough, right? The only vaguely subtle conceptual difficulty is how the recursive descent does not invalidate the re-ordering we did in the parent node. This is simply because regardless of how the child nodes reorganize the array, they are only going to reorganize the number we give them, i.e. the surviving rays. And no matter how those guys get shuffled down the tree, when the code comes back from the recursion (say after visiting the left child), we still pass a valid array of surviving rays to the right child. They are not in the same order anymore, but the permutation has only been done on the first half of the array, so we are still good to go.

So what do we get out of this? Well, a lot:

  • We are now working on N rays at a time, not just one. So that is a lot of work to do in each node, and there is plenty of time for your prefetch or your DMAs to finish.
  • The N rays we operate on are stored in a contiguous array, so accessing them is still very cache friendly. It would not be the case if we would have put all our rays in a hierarchy for example, and collided one ray-tree against the mesh.
  • We have N rays at a time, to collide against the same box or the same triangle. This is a perfect SIMD-friendly situation.
  • We traverse the tree only once now, so even if your tree layout is not cache-friendly at all, the number of traversal-related cache misses is minimized overall. In fact you can imagine it has just been divided by the number of batched rays…
  • We traverse each node only once now, so any extra work that you sometimes must do to access the AABB, like dequantization, is now only performed once instead of N times. Think about it: if you raycast N rays against the same mesh, you are dequantizing the top level AABB N times for example. With this new approach however, you do that only once.
  • In a similar way, we fetch a candidate triangle only once and raycast N rays against it. So the penalties associated with getting the triangle data (and there are quite a few, usually) are also reduced when multiple rays end up touching the same triangle.

I illustrated this idea with raycasts, but of course it works equally well with overlap tests or sweeps.

I do not know if it is a new idea or not, but I haven’t seen it described anywhere before.

Fracture demo from GDC

March 16th, 2012

http://www.youtube.com/watch?v=QxJacxI4_oE&feature=share

New broad-phase demo

December 22nd, 2011

We recently had a friendly-yet-furious internal competition at work, where the goal was to create the fastest broad-phase algorithm ever.

We slowly reached the limits of what we can do with a 3-axes sweep-and-prune (SAP), and so we were trying to finally move away from it. This is not very easy, as SAP is very efficient in typical game scenes. Unfortunately as the total number of objects in games increase, and game worlds become more and more dynamic, the SAP performance increasingly becomes an issue.

Several people attacked the problem with several (and very different) approaches. I submitted two entries: one was the multi-SAP (MSAP) broad-phase I previously mentioned; the other one is something I probably cannot talk about yet, so let’s just call it “broad-phase X” for now. Long story short: this last algorithm won.

Out of curiosity, and since I had done it before with MSAP already, I tried it out in Bullet’s “CDTestFramework” to compare it against Bullet’s algorithms. On my machine the new broad-phase is a little bit more than 4X faster than any of Bullet’s versions, and more than 2X faster than MSAP. Of course the relative performances of all those algorithms vary with the total number of objects in the scene, how many of them are static, how many of them are moving, how fast they are moving, etc. So there isn’t usually a clear winner for all possible scenarios. Still, at work we tested those new algorithms in dozens and dozens of scenes with different characteristics, and this “broad-phase X” always performed remarkably well.

The demo is straight out of Bullet 2.78, compiled in Release mode with full optimizations and the SSE2 flag enabled.

Click here to download

The new broad-phase should be available in PhysX 3.3. Stay tuned! :)

Rayman Origins

December 14th, 2011

I’m listed in the Rayman Origins credits, in the “Special Thanks” section :)

I just contributed some code. Yay!

http://www.youtube.com/watch?v=jG2OU5fJyPM (around 9′20)

Iceland rules :)

December 1st, 2011

http://richarddawkins.net/articles/644070-let-s-talk-about-evolution

Knew it.

Hello world!

October 26th, 2011

Raphaël was born on october 23, 2011. He says hello!

Pensées du jour

September 30th, 2011

To iterate is human, to recurse is divine, and the Big Bang was a stack overflow.

—-

I should have a patent on patents. People would pay me recursively and I would get a cash overflow!

—-

The true meaning of LHS is “Little Hidden Stall”.

—-

A stupid magician is a Garcimoron!  (*)

(*) I realize only 1% of the audience will get that one :)

Simulationism

September 18th, 2011

Simulationism is basically the belief that we actually live in a computer simulation. While not seriously believing in it, I found it interesting from a programmer’s point of view. Bizarrely a lot of things “make sense” from this perspective:

- We live in a giant, cosmic version of a computer simulation. It started with a Big Bang, equivalent of a Big Boot. It might end up with a reboot. The universe is simply the sandbox in which the simulation runs.

- The entity that created this sandbox can be seen as the First Scientist, or the First Programmer, or whatever you want to call it. It is a being outside of the universe, i.e. outside of the simulation. We might have been created in its image, or not. It depends on what the simulation is trying to prove or achieve, which is something we may never know.

- The laws of physics are simply the rules that have been hardcoded to make the simulation work. There might be no particular reason why the rules are what they are. Maybe they just make for an interesting simulation, in the same way totally artificial and arbitrary game rules create interesting gameplay.

- Fundamental physical constants appear meticulously tweaked because they actually are. Every programmer knows about those “magic values” that most of us used at one point or another to make everything work well. The First Programmer may have tweaked the gravity force, etc, with a cosmic slider just like we would tweak the lacunarity or the fractal increment when generating a procedural landscape.

- We are probably not the first simulation run. We are just one particular run, and if those constants all work out ok this is simply because a large number of failures might have preceded us. Things might not be perfect yet because we are a work in progress, in a way similar to what Teilhard De Chardin was writing (e.g. about the Omega Point). In this iteration humans may not be very wise yet, but we may do better in the next run.

- As a consequence, there might also be no reason for the existence of some things in our universe, in the same way there is often “dead code” remaining in a codebase. Introns in our DNA might be just that. As programmers know, there is no reason to optimize or even clean up your code before it even works. In other words we’re still prototype code, not production code.

-  We have no way to know what the outside world, where the First Programmer lives, looks like. We are quite simply “evolved virtual creatures” similar to the ones Karl Sims created a while back. Only much, much more involved, to the point that our conscience emerged. But then again, conscience might just be a convenient way to label what appears, to us, like unimaginable complexity (which reminds me of what Jean Guitton was saying about randomness: there’s no randomness, it just appears random to us because the forces that acted to produce a given event are beyond our analytical capabilities). So while conscience might appear like a miraculous trait to us, it might just be because we lack the proper caps to grasp it. In the same way a blind person can not really understand what the color blue or red is, we might lack the sense to properly understand conscience. But it might end up being a simple thing to program for a higher being like our First Programmer, in the same way programming “eyes” for a game AI is relatively easy in our own computer simulations. The bottom line anyway, is that we can not imagine the world outside of our universe, no more than a game AI can imagine our “real” world beyond the walls of the computer memory.

- We only see our world through imperfect sensors (our eyes do not see UVs or infrareds, our ears can not hear infra or ultra-sounds, our sense of touch is good for our fingertips but lousy in our back, etc, etc, all our sensors are pretty limited). In the same say a game AI sees the computer world through imperfect sensors like raycasts, sound volumes, collision detection checks, etc. Our limited sensors can not even sample the world inside our universe accurately (our brain does its best to construct something from the limited-accuracy inputs), so they are totally inadequate to figure out the real world of the First Programmer, outside the universe. Similarly, a game AI only “sees” and “hears” and “feels” what its limited sensors have been programmed to feed it with. Those sensors only let the game AI capture a small part of the program they live in. If we would succeed in creating a real AI whose consciousness would emerge, it would first discover the concept of a computer, i.e. the universe beyond their game world. But it still would not be able to imagine our world behind, in the same way our limited sensors can not tell us much about the world of the First Programmer.

- “Rien ne se perd, rien ne se crée, tout se transforme”. This is because memory is limited, really. Atoms or sub-particles are counterparts of bits in the computer memory. When an object is deleted and a different object gets created at the same memory location, it is the same as when our bodies die, decompose, and go back to cosmic dust. Our giant cosmic simulation has a finite amount of resources, in the same way a computer has a finite amount of memory.

- The First Programmer does not intervene in human affairs, does not answer prayers, does not perform miracles, does not spy on each simulated entity, simply because this is not how simulations work. You usually do not mess with a simulation while it is running. You tweak the settings, run the simulation to the end, check the output, adjust the parameters and run another one, until you get the desired results.

- A special note must be written about time, which is a very relative concept. We already know from Einstein, Langevin and others, that time slows down when you go faster (see e.g. the twins paradox, etc). To the limit, when you reach the speed of light, time does not flow anymore, it stops. The photon knows its complete history from birth to death in an instant, etc. For our computer simulations, time passes a lot quicker than for us. A lot of things happen inside a computer simulation in a few nanoseconds, a lot of history. The same might be true for the First Programmer and his simulation, i.e. us. Many centuries for us might pass in the blink of the First Programmer’s eye. This does not favorize interventions or reactions to events happening in our simulated world, simply because it is too quick for the First Programmer to react. For example the whole modern civilization might be simulated in one frame of the First Programmer’s game, so there is not much he can do about punishing sins or rewarding good deeds (if he even cares, after all we are only game AIs in this). The only thing he can do is record the simulation results and analyze them later when the simulation has ended, i.e. after the end of our world and before it is reborn/rebooted for a new iteration.

Static/Bipartite SAP

September 14th, 2011

Another idea for optimizing the SAP when we have a small amount of dynamic objects moving amongst a large amount of static objects:

Put the dynamics in their own vanilla SAP. Get the dynamic-vs-dynamic overlaps from there, nothing new.

Put the dynamics & statics in a special SAP (“Static SAP”, or “Bipartite SAP”) modified to only report dynamic-vs-static overlaps:

  • The statics are stored in the SAP as usual (same box/endpoint structures as before)
  • The dynamics however are stored in a separate array, using a new box structure.
  • When the dynamic objects are inserted in the SAP, their endpoints are not inserted into the same array as for statics. They are stored directly in the new box structure, i.e. in the (dynamic) box class itself. They link to the static arrays of endpoints (their virtual position within those arrays, if they would have been inserted there as usual)
  • When running “updateObject” on those dynamic objects, we move their endpoints virtually instead of for real. That is, we still look for the new sorted positions within the arrays of endpoints, but we don’t actually shift the buffers anymore. The update is a const function for the main SAP data-structure, the only thing that gets modified is the data within the dynamic box we’re updating. In other words we don’t have real “swaps” anymore. No more memmoves, thus reduced number of LHS, etc.

The drawback is that we update dynamic objects twice, once in the “static SAP”, once in the normal SAP.

Didn’t try it, don’t know about the perf, but might be interesting.