October 15th, 2014
“Interestingly enough, on desktop-like platforms, PhysX3 is faster than Box2D”
October 15th, 2014
“Interestingly enough, on desktop-like platforms, PhysX3 is faster than Box2D”
October 14th, 2014
So, yes, PhysX 3 is well optimized (and PhysX 3.4 even more). But I’d like to go back to the old “2.8.4 is crippled” myth, since this is mentioned here again (”PhyX 2.x was garbage because a ton of sites outed nVidia for using x87 and lacking multi-threading on the old CPU code”).
This is not what happened, at all. I’m afraid you’ve been spoon-fed utter bullshit by websites that love an easy dramatic headline. I suppose it makes a good story to reveal nasty things about “big” companies like MS, Google, Apple, Nvidia, whatever. But the reality behind it here is terribly mundane and ordinary. There is no story. There is no conspiracy. There is no evil plan to cripple this or that.
NovodeX (on which PhysX was initially based) was written by Adam and me. The code was super optimized, to the best of my knowledge at the time. But it did not support SIMD or multi-threading. At that time, none of us knew how to write SSE code, and I also had no experience with multi-threading. Also, IIRC SSE2 was not widely supported (only SSE). From our perspective the gains from SSE seemed limited, using SSE2 would have made the SDK incompatible with most of our users’ machines, and we simply didn’t have the time or resources to learn SSE, support SIMD and non-SIMD versions, etc.
Then came Ageia. In the beginning, we had to make the SDK feature-complete before thinking about making it run faster. NovodeX did not even support convex objects! And that’s the first thing I had to implement in the post-NovodeX days. NovodeX was the fusion of two hobby projects from two independent developers. In a number of ways the result was still just that: a hobby project. We loved it, we worked very hard on it, but we basically had no customers and no actual games making actual requests for random features that are actually useful in games. This all changed when the hobby project became PhysX under Ageia. That’s when it became an actual product. An actual middleware. With actual customers actually using the thing in actual games. Especially when it got picked up by Epic and integrated in the Unreal engine. Suddenly we got TONS of bug reports, feedback, feature requests, you name it. Tons of stuff to do. But as far as I can remember nobody asked for “the SSE version”, and so none of us worked on it. There was no time for that, no incentive to worry about it, and we still didn’t have a SIMD expert at the time anyway. We briefly looked at the new compiler options in MSVC (/SSE2, etc) but the gains were minimal, maybe 15 to 20% at best. If you believe that recompiling with such a flag will magically make your code run 4X faster, I am sorry but you are just a clueless fool misguided soul. At the time, with the available compiler, we never saw more than 20% in the very best of case. And most of the time, for actual scenes running in actual games, we saw virtually no gains at all. Enabling the flag would have given us marginal gains, but would have increased our support burden significantly (forcing us to provide both SIMD and non-SIMD versions). It would have been stupid and pointless. Hence, no SIMD in PhysX2. Simple as that.
For proper SIMD gains you need to design the data structures accordingly and think about that stuff from the ground up, not as an afterthought. And this is exactly what we did for PhysX3. After making PhysX2 stable and complete enough, after making it a real, useable, feature-complete middleware, it was finally time to think about optimizations again. It took time to make the software mature enough for this to be even on the roadmap. It took time for us (and for me) to actually learn SIMD and multi-threading. It took time for compilers to catch up (/SSE2 on latest versions of MSVC is way way way better and produces way more efficient code than what it did when we first tried it). It took time for SSE2 support to spread, and be available in all machines (these days we only have a SIMD version - there is no non-SIMD version. It would have been unthinkable before). And still, even after all this happened, a better algorithm, or better data structures or less cache misses, still give you more gains than SIMD. SIMD itself does not guarantee that your code is any good. Any non-SIMD code can kick SIMD code’s ass any day of the week if SIMD code is clueless about everything else. Anybody claiming that PhysX2 is “garbage” because it doesn’t use SIMD is just a ridiculous moron (pardon my French but hey, I’m French) clearly not a programmer worth his salt (or not a programmer at all for that matter).
So there was no crippling. The old version is just that: old. The code I wrote 10 years ago, as fast as it appeared to be at the time, is not a match for the code I write today. Opcode 2 (which will be included in PhysX 3.4) is several times faster than Opcode 1.3, even though that collision library is famous for its performance. It’s the same for PhysX. PhysX 2 was faster than NovodeX/PhysX 1. PhysX 3 is faster than PhysX 2. We learn new tricks. We find new ideas. We simply get more time to try more options and select the best one.
As the guy in the article says, PhysX3 is so fast that it changed his mind about the whole GPU Physics thing. Does that sound like we’re trying to promote GPU Physics by crippling PhysX3? Of course not. And in the same way we did not try to promote Ageia Physics by crippling PhysX2. We were and we are a proud bunch of engineers who love to make things go fast - software or hardware.
EDIT: I forgot something. Contrary to what people also claim, PhysX works just fine on consoles and it is a multi-platforms library. That is, writing SIMD is not as easy as hardcoding a bunch of SSE2 intrinsics in the middle of the code. It has to be properly supported on all platforms, including some that do not like basic things like shuffles, or do not support very useful instructions like movemask. Converting something to SIMD means writing the converted code several times, possibly in different ways, making sure that the SIMD versions are faster than their non-SIMD counterparts on each platform - which is not a given at all. It takes a lot of time and a lot of effort, and gains vary a lot from one platform to the next.
June 5th, 2014
The website has been down for a month. I was busy and didn’t notice. Reason is explained below.
I have no time for this crap so I just removed the file.
I am sorry to inform you that your account has been suspended for creating excessive server load that has affected other customers. The file causing the issue is:
You will either need to remove these files, limit access to them or block some of the IP addresses that are repeatedly accessing these files. Your access logs are located in your stats directory and you can use these to analyse the traffic to your site.
If you find that the traffic is genuine, we recommend moving your website to a Virtual Private Server (VPS).
If the traffic appears to be suspicious, we suggest blocking those suspicious IP’s through an .htaccess file; the ‘.htaccess editor’ tools will allow you do to so. This will allow you to control how much, and what kind of traffic gets to the files in question.
Please let us know the exact steps you have taken to address this issue so that we can then reevaluate the suspension of your account.
March 20th, 2014
A long time ago (around 2001), I helped Tomas Möller optimize a triangle-vs-box, SAT-based overlap test. At that time I was working on the Opcode library, and such a function was needed to achieve my goals.
When working on Opcode 2.0, I wrote a SIMD version of that same function. Since then I have found several other SIMD implementations of that code, and while benchmarking them I found that most of them were surprisingly slow - sometimes slower than the original code! I cannot really explain why, but it seems to indicate that writing the SIMD version is not as straightforward as one could have imagined. And thus, I am now releasing my implementation, in the hope somebody will find it useful.
There are two versions of the original code, labelled “old code” and “new, faster code” online. Despite the claims, on my Windows-based machine the “faster code” is significantly slower. It seems correct and expected that doing the “class III” tests last is the best, but I suppose this depends on both the architecture and the test scenario. In any case both versions (more or less verbatim, not quite) are included in the test benchmark.
The original code contains a lot of early exits, and thus the time it takes depends a lot on what codepath is taken. The best case is when you reach the first early exit. The worst case is when you reach the end of the function, i.e. the triangle and the box overlap. I believe one of the biggest mistakes made in the slow SIMD implementations was to get rid of branches and early exits at all costs (”Hey look! Zero branches! It must be super fast!”). Well of course that is a very naive view, things are way more subtle and complex than that. The fastest code is the one that is never executed, and thus, you should embrace early exits, not avoid them. In fact, my implementation contains one more early exit than the original code, for when the box fully contains the triangle (this was an important case for me in some application).
I have 4 different test scenarios:
- best case
- worst case
- no hit
“Best case” is the aforementioned new early exit, for when the box fully contains the triangle. This is basically a point-in-AABB test, it is the first early exit in the code, and thus the “best case” is when we reach that codepath.
“Worst case” is when the box touches the triangle, but the triangle is not fully inside. In that case we reach the end of the overlap function without taking any early exits. This should be the slowest codepath.
“No hit” is when the triangle does not touch the box, and we reach one of the “regular” early exits (i.e. the ones that also existed in the non-SIMD code). Time depends on what codepath is reached.
“Mixed” is just a mix of the above cases.
Here are typical results on my test PC:
Version1 (Möller’s “old code”):
Time (default): 5072
Time (default): 3773
Time (default): 953
Time (default): 5088
Time (optimized): 246
Time (optimized): 1898
Time (optimized): 329
Time (optimized): 1544
Best case: Speedup: 20.617886
Worst case: Speedup: 1.987882
No hit: Speedup: 2.896657
Mixed: Speedup: 3.295337
Version2 (Möller’s “new, faster code”):
Time (default): 6209
Time (default): 4714
Time (default): 1304
Time (default): 5199
Time (optimized): 237
Time (optimized): 1822
Time (optimized): 316
Time (optimized): 1540
Best case: Speedup: 26.198313
Worst case: Speedup: 2.587267
No hit: Speedup: 4.126582
Mixed: Speedup: 3.375974
So what we see here:
- the “new, faster code” is in fact slower. Thus, ignore “Version2″ from now on.
- the non-SIMD and SIMD version all find the same numbers of hits (as they should). I could have better tests for this but I’ve been using the SIMD version for a while now and it should be quite solid.
- the new, additional early exit gives me a 20X speedup. Not bad for a SIMD version whose theoretical maximum speedup is 4X. In the SIMD code, this is taken care of by this snippet:
const unsigned int MaskI = 0×7fFFffFF;
__m128 cV = _mm_and_ps(v0V, _mm_load1_ps((const float*)&MaskI));
const unsigned int test = _mm_movemask_ps(_mm_sub_ps(cV, extentsV));
That is well worth it if you ask me. Helped enormously in some past projects of mine.
- Speedups otherwise vary between 2X and 3X, which is what you usually expect from a good SIMD implementation.
And that’s about it. This is straightforward but apparently it is easy to write a bad SIMD version, that ends up slower than the original. I suppose some people just write the code, assume it’s good, and do not run benchmarks? I don’t know.
December 13th, 2013
Inspired by my friend John Ratcliff, I just created a bitcoin tip jar for myself. Let’s see what the fuss is all about!
If you used my radix sort code in one of your project…
Or my triangle stripper…
If you enjoyed my posts about optimizing convex-convex SAT tests…
Or the one about precomputed node sorting…
Or the one about restricting “this”…
If you enjoyed playing Konoko Payne…
Or if you enjoyed my posts on the GD-Algorithms list…
Or if you enjoyed some of my demos on ST even…
Or any of the things I did until now…
Then feel free to donate some bitcoins!
The address is:
June 20th, 2013
Yes I’ve closed the comments after a recent spam attack that left me with 750+ comments to moderate in just one day. Sorry, I have no time for this.
May 23rd, 2013
In my previous posts I mentioned some R&D code I wrote for large stable stacks. I can’t talk much about the algorithms involved, for obvious reasons, but I can share a small proof of concept demo.
(Note: don’t let the “CUDA Test” title fool you, it’s all CPU)
May 23rd, 2013
Looks like this blog doesn’t render properly in Internet Explorer. Oh well, no idea why. Use Firefox.
May 13th, 2013
I got a bunch of questions about my last series of blog posts so I thought I’d add a quick note here - at the risk of confusing people even more.
The figures I posted are for the CPU part of PhysX only. This does not concern or affect the GPU parts of PhysX in any way. Those things are orthogonal. If we optimize the CPU parts and get a 10X speedup, it does not mean your GPU will suddenly provide 10X less value, because it is running others parts of PhysX anyway - neither the rigid bodies, nor the raycasts/sweeps.
Only a few features are GPU-accelerated, e.g. cloth or particles, mainly because they are the ones that map well to GPUs, and they are the ones for which the GPUs provide real X factors.
Now as shown in the recent “destruction video” I posted, people here are also working on GPU-optimized rigid bodies. This new module is called “GRB”, and it is currently not part of PhysX. But it does provide a speedup compared to our current CPU solution. In other words, it is still faster than PhysX 3.3 on CPU. You might have a hard time believing it, but people are trying to be fair and honest here. One of our tasks is to optimize the CPU rigid bodies as much as we possibly can, just to make sure that the GPU rigid bodies do provide some actual benefit and speedups. If you don’t do that, you release your GPU solution, it’s slower than a CPU solution, and you look like a fool. Like AGEIA. We are not making that mistake again. The CPU solution is here as a reality check for ourselves. I suppose we could just use Bullet or Havok for this, but… well… we think we can do better
Meanwhile, it is correct that the features that do work on GPU are currently only working on NVIDIA cards, simply because they are implemented using CUDA. There are both obvious political and technical reasons for this. It should be pretty clear that at the end of the day, NVIDIA would like you to choose one of their GPUs. If you are actually complaining about that, then there is little discussion left to have. Of course they want to sell their products, like every other company in the world. And of course they are going to use their own technology, CUDA, to do so. To me this is pretty much the same as what we had in the past with D3D caps. Some cards supported cubemaps, or PN-triangles, or whatever, and some didn’t. GPU PhysX is the same. It’s just an extra cap supported by some cards, and not by other. Complaining about this is silly to me. It would be like complaining that ATI didn’t make any effort to make PN-triangles work on NVIDIA cards. Seriously, what?
The deal is simple. NVIDIA gives you a free, efficient, robust physics engine. In exchange, if possible, add some extra GPU effects to give people an incentive to buy NVIDIA cards. Fair enough, right? I don’t see what the fuss is all about.
Anyway, the usual disclaimer applies here: I’m not a spokesperson for NVIDIA, what I write are my own thoughts about it, and for all I know I may be completely wrong about their intentions. What I know for a fact though, is that most of the stuff I read online about PhysX is just completely insane wrong.
I’ve been optimizing rigid body simulation in NovodeX/PhysX for a long time now, and there’s no big conspiracy behind it. Again, all those engines are free and publicly available so I invite you to run your own experiments, do your own benchmarks, and see for yourselves. We really have nothing to hide.
May 11th, 2013
Part2: Rigid bodies (box stacks)
Part3: Rigid bodies (convex stacks)
Part4: Rigid bodies (piles)
Part5: Rigid bodies (compounds)
Part6: Rigid bodies (meshes)
Part8: Raycasts vs simple shapes
Part9: Raycasts vs meshes
Part10: Sweep tests
Part11: More sweep tests
Part12: Final test and conclusion