SIMD box-triangle overlap function

March 20th, 2014

A long time ago (around 2001), I helped Tomas Möller optimize a triangle-vs-box, SAT-based overlap test. At that time I was working on the Opcode library, and such a function was needed to achieve my goals.

When working on Opcode 2.0, I wrote a SIMD version of that same function. Since then I have found several other SIMD implementations of that code, and while benchmarking them I found that most of them were surprisingly slow - sometimes slower than the original code! I cannot really explain why, but it seems to indicate that writing the SIMD version is not as straightforward as one could have imagined. And thus, I am now releasing my implementation, in the hope somebody will find it useful.

There are two versions of the original code, labelled “old code” and “new, faster code” online. Despite the claims, on my Windows-based machine the “faster code” is significantly slower. It seems correct and expected that doing the “class III” tests last is the best, but I suppose this depends on both the architecture and the test scenario. In any case both versions (more or less verbatim, not quite) are included in the test benchmark.

The original code contains a lot of early exits, and thus the time it takes depends a lot on what codepath is taken. The best case is when you reach the first early exit. The worst case is when you reach the end of the function, i.e. the triangle and the box overlap. I believe one of the biggest mistakes made in the slow SIMD implementations was to get rid of branches and early exits at all costs (”Hey look! Zero branches! It must be super fast!”). Well of course that is a very naive view, things are way more subtle and complex than that. The fastest code is the one that is never executed, and thus, you should embrace early exits, not avoid them. In fact, my implementation contains one more early exit than the original code, for when the box fully contains the triangle (this was an important case for me in some application).

I have 4 different test scenarios:
- best case
- worst case
- no hit
- mixed

“Best case” is the aforementioned new early exit, for when the box fully contains the triangle. This is basically a point-in-AABB test, it is the first early exit in the code, and thus the “best case” is when we reach that codepath.

“Worst case” is when the box touches the triangle, but the triangle is not fully inside. In that case we reach the end of the overlap function without taking any early exits. This should be the slowest codepath.

“No hit” is when the triangle does not touch the box, and we reach one of the “regular” early exits (i.e. the ones that also existed in the non-SIMD code). Time depends on what codepath is reached.

“Mixed” is just a mix of the above cases.

Here are typical results on my test PC:

Version1 (Möller’s “old code”):

Time (default): 5072
Time (default): 3773
Time (default): 953
Time (default): 5088
Time (optimized): 246
Time (optimized): 1898
Time (optimized): 329
Time (optimized): 1544
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 20.617886
Worst case: Speedup: 1.987882
No hit:     Speedup: 2.896657
Mixed:      Speedup: 3.295337

Version2 (Möller’s “new, faster code”):

Time (default): 6209
Time (default): 4714
Time (default): 1304
Time (default): 5199
Time (optimized): 237
Time (optimized): 1822
Time (optimized): 316
Time (optimized): 1540
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 26.198313
Worst case: Speedup: 2.587267
No hit:     Speedup: 4.126582
Mixed:      Speedup: 3.375974

So what we see here:

- the “new, faster code” is in fact slower. Thus, ignore “Version2″ from now on.

- the non-SIMD and SIMD version all find the same numbers of hits (as they should). I could have better tests for this but I’ve been using the SIMD version for a while now and it should be quite solid.

- the new, additional early exit gives me a 20X speedup. Not bad for a SIMD version whose theoretical maximum speedup is 4X. In the SIMD code, this is taken care of by this snippet:

const unsigned int MaskI = 0×7fFFffFF;
__m128 cV = _mm_and_ps(v0V, _mm_load1_ps((const float*)&MaskI));
const unsigned int test = _mm_movemask_ps(_mm_sub_ps(cV, extentsV));
return 1;

That is well worth it if you ask me. Helped enormously in some past projects of mine.

- Speedups otherwise vary between 2X and 3X, which is what you usually expect from a good SIMD implementation.

And that’s about it. This is straightforward but apparently it is easy to write a bad SIMD version, that ends up slower than the original. I suppose some people just write the code, assume it’s good, and do not run benchmarks? I don’t know.

Bitcoin tip jar is here if you end up using that code. Project files are here.

Bitcoin tip jar

December 13th, 2013

Inspired by my friend John Ratcliff, I just created a bitcoin tip jar for myself. Let’s see what the fuss is all about!

If you used my radix sort code in one of your project…

Or my triangle stripper…

Or Flexporter…

Or Opcode…

If you enjoyed my posts about optimizing convex-convex SAT tests…

Or the one about precomputed node sorting…

Or the one about restricting “this”…

If you enjoyed playing Konoko Payne…

Or if you enjoyed my posts on the GD-Algorithms list…

Or if you enjoyed some of my demos on ST even…

Or any of the things I did until now…

Then feel free to donate some bitcoins! :)

The address is:


Closed comments

June 20th, 2013


Yes I’ve closed the comments after a recent spam attack that left me with 750+ comments to moderate in just one day. Sorry, I have no time for this.

Mega stacks

May 23rd, 2013

In my previous posts I mentioned some R&D code I wrote for large stable stacks. I can’t talk much about the algorithms involved, for obvious reasons, but I can share a small proof of concept demo.

Click here to download.

(Note: don’t let the “CUDA Test” title fool you, it’s all CPU)


May 23rd, 2013

Looks like this blog doesn’t render properly in Internet Explorer. Oh well, no idea why. Use Firefox.

The evolution of PhysX - Addendum

May 13th, 2013

I got a bunch of questions about my last series of blog posts so I thought I’d add a quick note here - at the risk of confusing people even more.

The figures I posted are for the CPU part of PhysX only. This does not concern or affect the GPU parts of PhysX in any way. Those things are orthogonal. If we optimize the CPU parts and get a 10X speedup, it does not mean your GPU will suddenly provide 10X less value, because it is running others parts of PhysX anyway - neither the rigid bodies, nor the raycasts/sweeps.

Only a few features are GPU-accelerated, e.g. cloth or particles, mainly because they are the ones that map well to GPUs, and they are the ones for which the GPUs provide real X factors.

Now as shown in the recent “destruction video” I posted, people here are also working on GPU-optimized rigid bodies. This new module is called “GRB”, and it is currently not part of PhysX. But it does provide a speedup compared to our current CPU solution. In other words, it is still faster than PhysX 3.3 on CPU. You might have a hard time believing it, but people are trying to be fair and honest here. One of our tasks is to optimize the CPU rigid bodies as much as we possibly can, just to make sure that the GPU rigid bodies do provide some actual benefit and speedups. If you don’t do that, you release your GPU solution, it’s slower than a CPU solution, and you look like a fool. Like AGEIA. We are not making that mistake again. The CPU solution is here as a reality check for ourselves. I suppose we could just use Bullet or Havok for this, but… well… we think we can do better :)

Meanwhile, it is correct that the features that do work on GPU are currently only working on NVIDIA cards, simply because they are implemented using CUDA. There are both obvious political and technical reasons for this. It should be pretty clear that at the end of the day, NVIDIA would like you to choose one of their GPUs. If you are actually complaining about that, then there is little discussion left to have. Of course they want to sell their products, like every other company in the world. And of course they are going to use their own technology, CUDA, to do so. To me this is pretty much the same as what we had in the past with D3D caps. Some cards supported cubemaps, or PN-triangles, or whatever, and some didn’t. GPU PhysX is the same. It’s just an extra cap supported by some cards, and not by other. Complaining about this is silly to me. It would be like complaining that ATI didn’t make any effort to make PN-triangles work on NVIDIA cards. Seriously, what?

The deal is simple. NVIDIA gives you a free, efficient, robust physics engine. In exchange, if possible, add some extra GPU effects to give people an incentive to buy NVIDIA cards. Fair enough, right? I don’t see what the fuss is all about.


Anyway, the usual disclaimer applies here: I’m not a spokesperson for NVIDIA, what I write are my own thoughts about it, and for all I know I may be completely wrong about their intentions. What I know for a fact though, is that most of the stuff I read online about PhysX is just completely insane wrong.

I’ve been optimizing rigid body simulation in NovodeX/PhysX for a long time now, and there’s no big conspiracy behind it. Again, all those engines are free and publicly available so I invite you to run your own experiments, do your own benchmarks, and see for yourselves. We really have nothing to hide.

The evolution of PhysX - Index

May 11th, 2013

Part1: PEEL

Part2: Rigid bodies (box stacks)

Part3: Rigid bodies (convex stacks)

Part4: Rigid bodies (piles)

Part5: Rigid bodies (compounds)

Part6: Rigid bodies (meshes)

Part7: Joints

Part8: Raycasts vs simple shapes

Part9: Raycasts vs meshes

Part10: Sweep tests

Part11: More sweep tests

Part12: Final test and conclusion


The evolution of PhysX (12/12) - Final test and conclusion

May 11th, 2013

This last test (“SeaOfStaticBoxes3”) didn’t fit in any category, so I kept it for the end. It does not actually do anything, it only creates a massive amount of static boxes (255*255). There is otherwise no dynamic objects in the scene, nothing to simulate at all.

There are 2 things to notice here. The first one is memory usage. We went from 94Mb to 54Mb to 39Mb, from one PhysX version to another. That’s the right direction here!

The other thing is the time it takes to simulate an empty scene. As you can see, it doesn’t take much time with PhysX – as it should. Bullet however, for some reason, takes a massive amount of time here. On my home PC, Bullet takes about 34ms (!?) to simulate this empty scene. I double-checked everything, looking for something I did wrong, but it turns out the problem has already been reported on the Bullet forums. I think this should be fixed and work out-of-the-box.

In any case, I am not here to fix Bullet issues. The point is simply, again, that contrary to what people may still believe, PhysX is actually very optimized and a perfectly fine CPU physics engine. In fact, if some competitors would not prevent me from publishing the results, I would happily show you that it often beats everybody else. I invite curious readers to create their own benchmarks and see for themselves.


This report is currently incomplete: it did not talk about CCD, or multithreaded simulations, or overlap tests. It also barely scratched the surface of what these physics engines have to offer: I did not talk about character controllers, vehicles, the performance of binary deserialization, or any of those more advanced topics.

As you can easily imagine, I’d need to write a book to cover all this, not just some blog posts.

However I think these posts may reach their limited goal, which is simply to show with very basic tests that:

  • PhysX is getting faster all the time
  • PhysX is very well optimized, thank you very much

The evolution of PhysX (11/12) - More sweep tests

May 11th, 2013

Now that we’re done with the simple stuff, let’s jump to the next level: meshes.

The first mesh scene (“SceneBoxSweepVsStaticMeshes_Archipelago”) performs 1K vertical box sweeps against the mesh level we previously used.

There are no terribly big surprises here. All engines report the same number of hits. Each PhysX version is faster than the one before. They all perform adequately.

PhysX 3.3 is only about 2X faster than PhysX 2.8.4, so things were generally ok there.

Bullet suffers a bit, being a bit more than 6X slower than PhysX 3.3 on average.


For the next tests we switch back to the Konoko Payne level, which is much more complex and thus, maybe, closer to what a modern game would have to deal with. We start with sphere-sweeps against it (“SceneSphereSweepVsStaticMeshes_KP”).

Results are quite similar to what we got for the Archipelago scene: PhysX gets faster with each new version, 3.3 is a bit more than 2X faster compared to 2.8.4.

Bullet suffers again, being now a bit more than 8X slower than PhysX 3.3 on average.


The same test using box-sweeps (“SceneBoxSweepVsStaticMeshes_KP”) reveals the same kind of results again.

This time PhysX 3.3 is about 3.3X faster than 2.8.4.

Remarkably, PhysX 3.3 is more than an order of magnitude faster than Bullet here (about 11.6X)!


Finally, the same test using capsule-sweeps (“SceneCapsuleSweepVsStaticMeshes_KP”) confirms this trend.

PhysX 3.3 is a bit more than 2X faster than 2.8.4, once again.

As for Bullet, it is now about 15X slower than PhysX 3.3. I feel a bit sorry for Bullet but I think this proves my point: people who claim PhysX is not optimized for CPU should pay attention here.


Now the KP scene, as you may remember from the raycasts section, has a lot of meshes in the scene but each mesh is kind of small. We will now test the opposite, the Bunny scene, which was a scene with just one highly-tessellated mesh.

And since we are here to stress test the engines, we are going to run several long radial sweeps against it.

And using large shapes to boot.

This is not supposed to be friendly. This is actually designed to break your engines.

So, what do we get?


Let’s start with the spheres (“SceneSphereSweepVsStaticMeshes_TessBunny”).

Note that we only use 64 sweeps here, but it takes about the same time (or more) than performing 16K radial raycasts against the same mesh… Yup, such is the evil nature of this stress test.

It is so evil that for the first time, there really isn’t much of a difference between each PhysX versions. We didn’t progress much here.

Nonetheless, PhysX is an order of magnitude faster than Bullet here. Again, contrary to what naysayers may say, the CPU version of PhysX is actually very fast.

In fact, stay tuned.


Switch to capsule sweeps (“SceneCapsuleSweepVsStaticMeshes_TessBunny”). Again, only 64 sweeps here.

There is virtually no difference between PhysX 3.2 and 3.3, but they are both measurably faster than 2.8.4. Not by much, but it’s better than no gain at all, as in the previous test.

Now the interesting figure, of course, is the performance ratio compared to Bullet. Yep, PhysX is about 25X faster here. I can’t explain it, and it’s not my job. I’m only providing a reality check for a few people here.


Ok, this is starting to be embarrassing for Bullet so let’s be fair and show that we also blew it in the past. Kind of.

Switch to box sweeps (“SceneBoxSweepVsStaticMeshes_TessBunny_Test1”). And use fairly large boxes because it’s more fun. Again, we only need 64 innocent sweep tests to produce this massacre.

Look at that! Woah.

“Each PhysX version is faster than the one before”, hell yeah! PhysX 3.3 is about 4X faster than PhysX 3.2, and about 31X faster (!) than PhysX 2.8.4. Again, if you are still using 2.8.4, do yourself a favor and upgrade.

As for Bullet… what can I say? PhysX 3.3 is 123X faster than Bullet in this test. There. That’s 2 orders of magnitude. This is dedicated to all people still thinking that PhysX is “not optimized for the CPU”, or “crippled”, or any other nonsense.

The final test (“SceneLongBoxSweepVsSeaOfStatics”) has a pretty explicit name. It’s just one box sweep. The sweep is very large, going from one side of the world to the other, in diagonal. The world is massive, made of thousands of statics. So that’s pretty much the worst case you can get.

The results are quite revealing. And yes, the numbers are correct.

For this single sweep test, PhysX 3.3 is:

  • 60X faster than PhysX 3.2
  • 270X faster than PhysX 2.8.4
  • 317X faster than Bullet

Spectacular, isn’t it?

The evolution of PhysX (10/12) - Sweep tests

May 11th, 2013

Another important feature is sweep tests, a.k.a. “linear casts” or “shape casts”. This can be used, for example, to implement character controllers. What I previously wrote for raycasts is even more valid for sweeps: a lot of things depend on whether you go for a generic GJK-based sweep, or for dedicated shape-vs-shape sweep functions. And since sweep functions are usually much more difficult to write than raycast functions, this is often the part of physics engines where things sometimes go spectacularly wrong.

There are even more potential things to test here, than for raycasts. Even when you go for customized raycast functions, you get away with N of them (e.g. for N supported shapes you need to write N raycast functions). For sweeps however, you can decide to sweep any shape against any other, so you suddenly need to write N*(N-1)/2 functions, and each of them is a lot harder than just a raycast. This is why most engines just go for a generic GJK-based sweep function, which can be used as-is for all cases. The downside of this is that the generic code, as often, may not be as fast as a customized function.

But enough theory: let’s see what we get. I will only focus on a small set of selected tests, rather than tediously listing the results for all N*(N-1)/2 cases.


Our first scene (“SceneBoxSweepVsStaticSpheres”) uses the same array of spheres as our initial raycast test. But instead of doing a raycast against each of them, we do a box sweep. We are again using the scene-level API functions to do so.

The boxes are slowly rotating over time, to make sure that we hit all codepaths from the sweep functions, and just to make sure that things don’t suddenly break for some angle, for some reason.

The results are a bit different than usual. Each PhysX version is faster than the one before… kind of. This time there isn’t much of a difference between 2.8.4 and 3.2. It looks like 3.2 is a wee bit faster, but frankly that might be just noise at this point.

Bullet looks fine in this test, very similar to PhysX 2.8.4 and 3.2, although slightly slower. The profile curve also isn’t as smooth as for PhysX but overall it’s quite comparable.

The clear winner in this test though, is PhysX 3.3. It is about 3X faster than both Bullet and the previous PhysX versions. Nice.


Let’s stay with box-sweeps, but let’s try against static boxes now (“SceneBoxSweepVsStaticBoxes”).

Things are a bit different this time: Bullet is faster than 2.8.4 is that test. On the other hand it is slower than PhysX 3.x.

Each PhysX version is faster than the one before – checked. In fact, on average, each PhysX version is 2X faster than the one before. That’s quite something!


Ok we did box-sweeps vs spheres and boxes, let’s see what happens against capsules in the next test (“SceneBoxSweepVsStaticCapsules”).

PhysX 3.x is significantly faster than PhysX 2.8.4 – with again each PhysX version being faster than the one before. In terms of relative performance there’s a remarkable 6.7X speedup on average between 2.8.4 and 3.3. Nice.

Bullet has about the same performance as 2.8.4, with again a much more chaotic curve. But PhysX 3.3 is about 8X faster than Bullet on average here.


Now, you guessed it, after spheres, boxes and capsules, and in sake of completeness, let’s have a look at box-sweeps against convex objects (“SceneBoxSweepVsStaticConvexes”).

Wo-ah. Now that is a surprise! See, that’s what I what talking about: sometimes things can go spectacularly wrong with sweeps, and you really need a strong test suite to make sure you catch all those problems.

Clearly the 2.8.4 code had a problem here. I can’t remember what it was, but clearly, if you are still using 2.8.4 and you do a lot of box-sweeps against convexes, you seriously need to upgrade. On average, PhysX 3.3 is 34X faster than 2.8.4 in this case. Now that is some serious speedup.

Other than that, each PhysX version is faster than the one before (good), and contrary to what we saw just in the previous test, Bullet remains quite competitive here, even against PhysX 3.3.

So, that’s sweep tests. You never know until you actually profile.


Now, we could repeat this exercise for sphere-sweeps and capsule-sweeps against all the other shapes as we just did for boxes, but the results are very similar. The only surprise comes again from the capsule-vs-convex case (“SceneCapsuleSweepVsStaticConvexes”).

In short, this is the same as for box-vs-convex: 2.8.4 was pretty bad here. On average, PhysX 3.3 is 17X faster than 2.8.4 in this case.

Bullet’s profile curve is weird. I can’t explain why it suddenly becomes faster towards the end. It is probably related to the fact that the capsules are rotating, and the algorithm may switch from one case to another, but I don’t know the Bullet’s internals enough to tell. cialis