Random PhysX stuff

http://atforums.mobi/msg.php?threadid=2403672&catid=8&rnum=25

So, yes, PhysX 3 is well optimized (and PhysX 3.4 even more). But I’d like to go back to the old “2.8.4 is crippled” myth, since this is mentioned here again (”PhyX 2.x was garbage because a ton of sites outed nVidia for using x87 and lacking multi-threading on the old CPU code”).

This is not what happened, at all. I’m afraid you’ve been spoon-fed utter bullshit by websites that love an easy dramatic headline. I suppose it makes a good story to reveal nasty things about “big” companies like MS, Google, Apple, Nvidia, whatever. But the reality behind it here is terribly mundane and ordinary. There is no story. There is no conspiracy. There is no evil plan to cripple this or that.

NovodeX (on which PhysX was initially based) was written by Adam and me. The code was super optimized, to the best of my knowledge at the time. But it did not support SIMD or multi-threading. At that time, none of us knew how to write SSE code, and I also had no experience with multi-threading. Also, IIRC SSE2 was not widely supported (only SSE). From our perspective the gains from SSE seemed limited, using SSE2 would have made the SDK incompatible with most of our users’ machines, and we simply didn’t have the time or resources to learn SSE, support SIMD and non-SIMD versions, etc.

Then came Ageia. In the beginning, we had to make the SDK feature-complete before thinking about making it run faster. NovodeX did not even support convex objects! And that’s the first thing I had to implement in the post-NovodeX days. NovodeX was the fusion of two hobby projects from two independent developers. In a number of ways the result was still just that: a hobby project. We loved it, we worked very hard on it, but we basically had no customers and no actual games making actual requests for random features that are actually useful in games. This all changed when the hobby project became PhysX under Ageia. That’s when it became an actual product. An actual middleware. With actual customers actually using the thing in actual games. Especially when it got picked up by Epic and integrated in the Unreal engine. Suddenly we got TONS of bug reports, feedback, feature requests, you name it. Tons of stuff to do. But as far as I can remember nobody asked for “the SSE version”, and so none of us worked on it. There was no time for that, no incentive to worry about it, and we still didn’t have a SIMD expert at the time anyway. We briefly looked at the new compiler options in MSVC (/SSE2, etc) but the gains were minimal, maybe 15 to 20% at best. If you believe that recompiling with such a flag will magically make your code run 4X faster, I am sorry but you are just a clueless fool misguided soul. At the time, with the available compiler, we never saw more than 20% in the very best of case. And most of the time, for actual scenes running in actual games, we saw virtually no gains at all. Enabling the flag would have given us marginal gains, but would have increased our support burden significantly (forcing us to provide both SIMD and non-SIMD versions). It would have been stupid and pointless. Hence, no SIMD in PhysX2. Simple as that.

For proper SIMD gains you need to design the data structures accordingly and think about that stuff from the ground up, not as an afterthought. And this is exactly what we did for PhysX3. After making PhysX2 stable and complete enough, after making it a real, useable, feature-complete middleware, it was finally time to think about optimizations again. It took time to make the software mature enough for this to be even on the roadmap. It took time for us (and for me) to actually learn SIMD and multi-threading. It took time for compilers to catch up (/SSE2 on latest versions of MSVC is way way way better and produces way more efficient code than what it did when we first tried it). It took time for SSE2 support to spread, and be available in all machines (these days we only have a SIMD version - there is no non-SIMD version. It would have been unthinkable before). And still, even after all this happened, a better algorithm, or better data structures or less cache misses, still give you more gains than SIMD. SIMD itself does not guarantee that your code is any good. Any non-SIMD code can kick SIMD code’s ass any day of the week if SIMD code is clueless about everything else. Anybody claiming that PhysX2 is “garbage” because it doesn’t use SIMD is just a ridiculous moron (pardon my French but hey, I’m French) clearly not a programmer worth his salt (or not a programmer at all for that matter).

So there was no crippling. The old version is just that: old. The code I wrote 10 years ago, as fast as it appeared to be at the time, is not a match for the code I write today. Opcode 2 (which will be included in PhysX 3.4) is several times faster than Opcode 1.3, even though that collision library is famous for its performance. It’s the same for PhysX. PhysX 2 was faster than NovodeX/PhysX 1. PhysX 3 is faster than PhysX 2. We learn new tricks. We find new ideas. We simply get more time to try more options and select the best one.

As the guy in the article says, PhysX3 is so fast that it changed his mind about the whole GPU Physics thing. Does that sound like we’re trying to promote GPU Physics by crippling PhysX3? Of course not. And in the same way we did not try to promote Ageia Physics by crippling PhysX2. We were and we are a proud bunch of engineers who love to make things go fast - software or hardware.

EDIT: I forgot something. Contrary to what people also claim, PhysX works just fine on consoles and it is a multi-platforms library. That is, writing SIMD is not as easy as hardcoding a bunch of SSE2 intrinsics in the middle of the code. It has to be properly supported on all platforms, including some that do not like basic things like shuffles, or do not support very useful instructions like movemask. Converting something to SIMD means writing the converted code several times, possibly in different ways, making sure that the SIMD versions are faster than their non-SIMD counterparts on each platform - which is not a given at all. It takes a lot of time and a lot of effort, and gains vary a lot from one platform to the next.

10 Responses to “Random PhysX stuff”



  1. Zogrim Says:

    Nice insights, thanks)
    And don’t take it too personally - most users don’t understand all that PhysX engine vs PhysX effects stuff.

  2. js- Says:

    You should worry less about what ppl say in comments on news site. Back at that time (2006-2007), PhysX was representing a new hardware in the NVidia vs ATI war. It was new, so perception was negative. The confusion between the middleware and the hardware was big.

    On top of that, gaming communities are young and there are very immature individuals. See, some comments were just emotional, some were just intentional bullshits to strike fear and doubts, some were just troll. That’s the internet for you.

    So, that guy arguing about PhysX CPU optimization in comments is usually someone with very low technical knowledge, might know some programming but I hardly doubt that person has had access to the middleware source.

    From industry professionals, I have heard a lot of pro and cons but never anyone arguing about SIMD optimizations.

  3. oscarbg Says:

    Hi,
    nice post.. just coming when I was asking myself lots of questions..
    first related to your post, hope you have been planning for wider SIMD widths, as AVX1 has been shipping for almost four years now.
    I am assuming you need only fp vector instructions, but if not AVX2 is here for more than a year.. you know also AVX512 if not delayed could be
    in consumer CPUs in a year or two.. so question is if we should wait for another big redesign of engine, say Physx4 for AVX 256 bit and AVX512 support or is Physx3 as designed is already scalable to this widths.
    Also sorry to ask other questions here but you seem to be one of originial coders so there they go (I asked to Zogrim from Phsyxinfo some days ago some of this questions):
    1)seems Physx for Linux (v3.3.2) now supports GPU acceleration in addition to Windows that’s great news..
    It’s good to assume next is to expose PhysX GPU on Android targets? for Tegra K1 devices like Shield tablet, Nexus 9..
    for me seems Tegra K1 have similar GPU power to when first GPU Physx acceleration debuted in GPUs like 8800, GTX 280 cards, so that question makes sense..
    There is interest from Nvidia to expose PhysX GPU acceleration on Android?
    2)can we expect next APEX release (1.3.2?) based on Physx 3.3.2 to also support GPU acceleration on Linux. Or GPU accel on Linux is only for PhysX not Apex?
    3)can we expect PhysX ARM64 libraries for Android? Is Physx source code ready to be compiled as Android ARM64 library for new Nexus 9 devices with Tegra k1 64 bit.. so new Tegra Android development pack can ship with that? also for possible UE4 ARM64 support would be good..
    even both 1) and 3) questions in one i.e. can we expect PhysX GPU accelerated ARM64 Android libraries?
    4)as PhysX GPU accel. uses CUDA and CUDA is supported on Mac also can we expect PhysX GPU acceleration coming to MacOS? so we have three major desktop OSes (Win,Lin,Mac) with GPU support?
    5)hope you can ship some beta of much expected Physx 3.4.0 with FLEX ASAP.. for GDC 2015 would be good..

    ah and one more thing seems also a WinRT device is coming this year (Surface 3) with Tegra K1 so as you ship also Physx for WinRT hope Nvidia also exposes CUDA on WinRT and Physx for WinRT becomes GPU accelerated..

    as you see from lots of questions wish is Physx becomes GPU accelerated on all OSes (Win,Linux,Mac,Android,WinRT) that ship with Nvidia CUDA enabled devices (currently tegra k1)..

  4. admin Says:

    Only the cloth code uses AVX at the moment. There are separate AVX and non-AVX codepaths.

    There is currently no plan to use AVX in other parts of the SDK. We have more urgent things to do and few people to do them. But I don’t think we’d get a large speedup from AVX in other modules. It’s useful for cloth because there’s a lot of stuff to do on a small amount of data, but the rest of the physics stuff isn’t like that (with the exception of the solver perhaps).

    Something like contact generation has a lot of logic / branches / little bits of computation here and there, but you don’t get a large amount of vertices/triangles to process in these parts, it’s more about data access patterns, minimizing cache misses, etc.

    Same for raycasts/midphase kind of work: there is a bit of code in internal nodes of a tree that could benefit from AVX, like maybe the ray-vs-box overlap, but this is a small amount of code compared to traversing the tree/collecting touched triangles/etc. So even if AVX makes the SIMD code 2X faster (and in practice it will be less than that) then the midphase as a whole may become, maybe 10 or 20% faster overall. Which is nice, but not really a big deal.

    Anyway we might find good ways to use AVX further in the SDK, but so far it didn’t look terribly promising.

    No, we don’t just need FP vector instructions. We also use integer SIMD for a few things.

    Sorry, I can’t answer your GPU/APEX/Linux/Android/Mac questions. I don’t know what we’re going to release/expose next, it is not my decision, and the plans may change anyway. You should probably direct these questions to Nvidia or the PhysX support forums. They’ll know more than me about that stuff… (I’m a CPU guy mainly working on PC / consoles (like PS3/Xbox360-style consoles)).

  5. admin Says:

    “It’s useful for cloth because there’s a lot of stuff to do on a small amount of data” => gaaah. I meant a large amount of data :) (like: process all vertices from the cloth mesh. We never process all vertices from collision meshes in other algorithms/other parts of the SDK).

  6. admin Says:

    While I’m at it, another bit of SIMD-related facts: SIMD is not always a win, period. There are a bunch of functions in the SDK like sphere-vs-triangle-sweep, or capsule-vs-triangle overlap, etc, this kind of functions, that still use non-SIMD code simply because I never managed to make the SIMD version faster. That’s just the way it is. SIMD is great to optimize, say, the brute-force versions. But sometimes you have a non-SIMD version with a lot of logic & branches to avoid actually computing most things, and SIMD doesn’t really help in these cases. The brute-force SIMD version ends up slower than the smarter non-SIMD version, and SIMD can’t easily be applied on the smarter version (which too many branches and stuff that SIMD doesn’t like).

  7. admin Says:

    “which too many branches” => “which contains too may branches”

  8. oscarbg Says:

    Thanks the detailed response is very much appreciated..
    Will try to post this GPU/APEX/Linux/Android/Mac questions on Physx fórums..

  9. oscarbg Says:

    Thanks the detailed response is very much appreciated..
    Don’t know much to speak but perhaps once CPUs get wider vector support (like 512 bit) some GPU algorithms like the GPU rigid body pipeline by Takahiro Harada in Bullet3 start to make sense vs current best algos for CPUs.. frameworks like ISPC or even good OpenCL CPU backends could be good for easy implementation of vector width agnostic algorithms (I mean using algorithms expressed in the SPMD model like CUDA or OpenCL on CPUs)..
    As said perhaps that algorithms exploit vector units roughly as efficient as custom made..

    Will try to post this GPU/APEX/Linux/Android/Mac questions on Physx forums..
    Thanks..

  10. Alocaly Says:

    As a developer who worked quite closely with PhysX in 2004/2005 ( on GhostRecon Advanced Warfighter X360), I know that the team was always involved with performances !
    There was still a lot to do, to build at this time, but it was clear performances was always at the very center of the developments.

    And it’s nice to see it’s still improving !
    Cheers, Pierre !

shopfr.org cialis