Archive for the 'Programming' Category

PhysX tip: torsional friction

Thursday, January 17th, 2019

PhysX tip: in 4.0 there is support for torsional friction, to help robots grasp objects. See PxShape::setTorsionalPatchRadius() and PxShape::setMinTorsionalPatchRadius().

PhysX tip: pos vs vel iterations

Thursday, January 17th, 2019

PhysX tip: “position” iterations and “velocity” iterations are not equally important. My rule of thumb as a user: don’t bother increasing the velocity iters, leave it to 1. Focus on the position iter only. 4 for regular bodies and 16 or 32 for jointed objects give good results.

PhysX tip: PxSceneQueryUpdateMode

Thursday, January 17th, 2019

PhysX tip: are you calling simulate() multiple times per frame? Maybe for sub-stepping? Then you can use the newly introduced PxSceneQueryUpdateMode enum to only update the SQ system in the last substep. That can make “fetchResults” faster for the N-1 first substeps.

PhysX tip: use a scratch memory block

Thursday, January 17th, 2019

PhysX tip: pass a scratch memory block to the simulate() function! Even if it’s just 16 or 32K, it will help reducing the amount of temporary allocations. Just do it. Now.

PhysX tip: contact distance for joint limits

Thursday, January 17th, 2019

PhysX tip: beware of the default “contact distance” param for joint limits. By default it’s setup for performance rather than sim quality, and it ignores limit-related constraints until you’re about to reach them. Instead, set them up so that limits are always “active”.

PhysX tip: D6 or dedicated joints?

Thursday, January 17th, 2019

PhysX tip: D6 or dedicated joints? The dedicated joints like PxRevoluteJoint can be a wee bit faster on the CPU because their “solver prep” functions are simpler. But only D6 joints are fully-accelerated (joint shaders running on the GPU). Minor bit of trivia that you can ignore.

PhysX tip: per-actor number of iterations

Tuesday, January 15th, 2019

PhysX tip: the number of solver iterations is a per-actor parameter in PhysX. It is NOT a global parameter. So you can e.g. use 32 iterations for jointed objects and 4 for debris, in the same scene.

PhysX 4.0 + new open source license for PhysX 3.4

Monday, December 3rd, 2018

https://news.developer.nvidia.com/announcing-physx-sdk-4-0-an-open-source-physics-engine/

https://www.youtube.com/watch?v=K1rotbzekf0

Box pruning revisited - Part 17 - PhysX lessons

Friday, June 22nd, 2018

Part 17 – PhysX lessons

In this series, we have been looking at fairly low-level optimizations so far. For example reducing the number of instructions, reducing the amount of cache-misses, aligning the data, switching to SIMD, etc. Grosso modo, we have improved the implementation of an algorithm.

But that was only the first part. In this post we start investigating the other side of the story: high-level optimizations, i.e. improvements to the algorithm itself.

Multi Box Pruning

In part 1, I briefly explained where “box-pruning” came from. The details are available in the sweep-and-prune (SAP) document that I wrote a decade ago now. That paper also explains the issues with the traditional SAP algorithm, and introduces the “multi-SAP” idea to address them.

The box-pruning algorithm we have been optimizing so far is also a SAP variant, and as such it suffers from similar issues. For example box pruning also suffers from “useless interactions with far away objects”, as described in page 17 of the SAP document. This is easy to see: imagine a vertical stack of equally-sized boxes, with either Y or Z as the up/vertical axis. Our code projects the boxes on the X axis. So all “projected” values end up equal (all min values are the same, all max values are the same). The “pruning power” of the first loop drops to zero. The second loop basically degenerates to our brute-force version. This is of course an extreme example that would not happen exactly as-is in practice, but the important point is that mini-versions of that example do happen all over the place in regular cases, making the broadphase less efficient than it could be.

The “multi-SAP” solution used a grid over the 3D world. Objects were assigned to grid cells, and a regular SAP was used in each cell. By design faraway objects then ended up in faraway cells, reducing the amount of aforementioned useless interactions.

We added something similar to multi-SAP in PhysX starting from version 3.3. It was called “MBP”, for Multi Box Pruning. As the name suggests, it was basically the multi-SAP idea applied to a box-pruning algorithm rather than applied to a regular incremental sweep-and-prune. It turns out that algorithmic improvements to SAP also work for box-pruning.

The PhysX implementation differs from the Multi-SAP implementation detailed in the SAP document. For example it does not use a grid of non-overlapping cells. Instead, it needs user-provided broadphase regions, which can overlap each other. And each object can belong to an arbitrary number of cells. But other than these implementation details, the basics are the same as what I described ten years ago: the overlaps are found separately for each region, they are added to a shared pair manager structure, and some extra management code is needed to deal with objects crossing regions.

In our box pruning project here, we do not have any API for updating or removing objects from the structures: we effectively do the PhysX “add object” operations each time. This has pros and cons.

The bad thing about this is that it is not optimal in terms of performance. In all our timings and tests so far, we have always been recomputing the exact same set of overlaps each time – each “frame” if you want. A good broadphase would take advantage of this, notice that objects have not changed from one call to the next, and only the first call would actually compute the overlaps. Subsequent calls would be either free (as for the incremental SAP in PhysX) or at least significantly cheaper (as for MBP in PhysX).

On the other hand, the good thing about this is that we do not need the previously mentioned “extra management code” from MBP in our little project here. To replicate the “add object” codepath from MBP, all we would need is some grid cells and a pair manager, i.e. for example a hash-map. Recall that so far we added overlapping pairs to a simple dynamic array (like an std::vector<Pair>). In MBP, an object touching two separate grid cells is added to both cells. So when computing the overlaps in each cell individually, it is possible to find the same overlapping pair multiple times. Some mechanism is needed to filter out duplicates, and in PhysX we do this by adding pairs to a shared hash-map / hash-set.

We do not have user-defined broadphase regions in this project, but we could try to compute some of them automatically. Here is how it would work in 2D:

a) Start with a bunch of objects (the blue spheres). Compute the black bounding box around them.

Picture 1: compute the AABB around all objects

b) Subdivide the bounding box computed in the first step. This gives 4 broadphase regions.

Picture 2: subdivide the AABB into 4 regions

c) Classify objects depending on which region they fall in. Green objects go into the 1st region, yellow objects into the 2nd region, blue objects into the 3rd region, magenta objects into the 4th region.

Picture 3: assign objects to regions

d) The remaining red objects touch multiple regions. In PhysX using MBP, each red object is duplicated in the system, added to each region it touches.

Picture 4: objects touching multiple regions are duplicated

To find the overlaps, we then perform 4 “completeBoxPruning” calls (one for each region). Colliding pairs are added to a shared hash-map, which by nature filters out duplicates. So for example a pair of touching red objects, like the one overlapping the green and blue regions in the picture, will be reported twice by both the green and blue broadphase regions. But since we add the pair to a shared (unique) hash-map, the second addition has no effect.

So we could replicate this in our box-pruning project. We could implement a hash-map (or reuse one that I released on my website years ago) and give this a try. But adding the same objects to different regions also leads to increased memory usage. And the resulting code quickly becomes unpleasant.

So instead, in this project we’re going to try something different.

Bucket pruner

There is another interesting data structure in PhysX called a “bucket pruner”. This is used in the context of scene queries rather than broadphase (i.e. raycasts, overlap tests, etc). Long story short, we needed something that was faster to build than a regular AABB tree, while providing some speedup compared to testing each objects individually.

Roughly speaking, it is built exactly like what we just described before. The 4 previously defined broadphase regions are what we call the “natural buckets”. But the structure handles the red objects in a different way: they are not duplicated and added to existing regions; instead they go to a special 5th bucket named “cross bucket”.

Picture 5: put duplicated objects in their own region

It is called “cross bucket” (or boundary bucket) because objects inside it often form a cross-like shape. It also contains objects that cross a boundary, so the name has a double meaning.

Picture 6: the cross bucket

In PhysX the classification process is then repeated again for each bucket (with special subdivision rules for the cross-bucket). This is done recursively a fixed number of times. At the end of the day the bucket pruner is a mix between a BVH and a spatial-partitioning structure: we compute bounds around objects as in a BVH, but we divide the resulting space in equally-sized parts as in a spatial partitioning structure (e.g. a quadtree). Since we do not do any clever analysis to find the best splitting points, and because we only recurse a small fixed amount of times, the whole thing is quick to build – much quicker than a regular AABB tree for example.

Now, how does that help us for box-pruning?

This is where all the pieces of the puzzle come together, and my cunning plan is revealed.

Do you remember part 1, where I mentioned the BipartiteBoxPruning function? This is the companion to the CompleteBoxPruning function we have been optimizing so far. CompleteBoxPruning finds overlaps within a set of objects. BipartiteBoxPruning finds overlap between two separate sets of objects. As we mentioned in part 1, all the optimizations we have been doing up to this point are equally valid for the bipartite case. And while I did not put any emphasis on it, I kept updating the bipartite case as well along the way.

That was by design. I knew I was going to need it eventually.

It turns out that we can take advantage of the bipartite case now, to optimize the complete case. The trick is simply to run 4 CompleteBoxPruning calls for the 4 natural buckets (as in MBP), and then run 5 additional calls:

  • 1 CompleteBoxPruning call to find overlaps within the 5th bucket
  • 4 BipartiteBoxPruning calls to find overlaps between the 5th bucket and buckets 1 to 4

Because of the extra calls it might be less efficient than the MBP approach overall (I did not actually try both here), but this new alternative implementation has two advantages over MBP:

  • There is no need to duplicate the objects. We can simply reshuffle the same input arrays into 5 sections, without the need to allocate more memory.
  • There is no need for a hash-map. This approach does not generate duplicate pairs so we can keep adding our colliding pairs to our simple dynamic array like before.

Overall this approach is much simpler to test in the context of our box-pruning project, so that is why I selected it.

Implementation in our project

First, we allocate a little bit more space than before for our split boxes. Recall that in version 16 we did something like this:

The +1 in the BoxListX allocation was initially added in version 4, to make space for a sentinel value. The extra +5 was added later when we unrolled the loops that many times. And then the +1 in BoxListYZ is actually unnecessary: it appeared when we split the boxes in version 9c (we allocated the same amount of “X” and “YZ” parts), but we only need sentinels in the X buffer so we could allocate one less element in the YZ buffer.

Now, for version 17 we are going to create 5 distinct sections within these buffers (one for each bucket). Each bucket will be processed like our arrays from version 16, i.e. each of them needs to have sentinel values. Thus for 5 buckets we need to allocate 5 times more sentinels than before, i.e. a total of 30 extra entries in the X buffer. The YZ buffer does not need extra space however, thanks to our previous observation that allocating one extra entry there was in fact not needed. The resulting code looks like this:

Next, we compute the initial bounding box (AABB) around the objects. This is the black bounding box in picture 1). We already had a loop over the source data to compute “PosList” in version 16, so computing an extra bounding box at the same time is cheap (it does not introduce new cache misses or new branches):

The only subtle issue here is that the last SIMD load on the last box reads 4 bytes past the end of the source array. This can crash if the last allocated entry was exactly at the end of a memory page. The simplest way to address this is to allocate one more dummy box in the source array.

Once the AABB is computed, we can subdivide it as shown in picture 2). We are not doing any clever analysis to find the split points: we just take the middle of the box along Y and Z. The main box-pruning code projects the boxes on the X axis already (hardcoded in Part 5), so we ignore the X axis for our broadphase regions, and we split along the two other axes:

These two limit values and the initial bounding box implicitly define the 4 bounding boxes around the natural buckets. With this knowledge we can then classify boxes, and assign each of them to a bucket. Because we ignore the X axis, we effectively deal with 2D boxes here. A box fully contained within one of the 4 natural buckets is given an ID between 0 and 3. A box crossing a bucket’s boundary ends up in the cross bucket, with ID 4.

The classification is a fairly straightforward matter:

Each box is tested against the previously computed limits. In sake of completeness, let’s reverse-engineer the code. Let’s call B the main black box from Picture 1).

We see from the way it is computed that limitY is B’s center value on axis Y. We then compare it to the incoming box’s min and max Y values. We compare Y values together, that is consistent and good. If leftPart is true, it means the incoming box is fully on the left side of the split, i.e. the box does not cross the Y boundary. But if rightPart is true, the box is fully on the right side of the split. It should not be possible for leftPart and rightPart to be both true at the same time.

Similarly, limitZ is B’s center value on axis Z. We compare it to the incoming box’s min and max Z values, which is also consistent and correct. If lowerPart is true, it means the incoming box is fully on the lower side of the split, i.e. the box does not cross the Z boundary. But if upperPart is true, the box is fully on the upper side of the split. It should not be possible for lowerPart and upperPart to be both true at the same time.

The classification is then:

  • leftPart && lowerPart => the box is fully in region A
  • leftPart && upperPart => the box is fully in region B
  • rightPart && lowerPart => the box is fully in region C
  • rightPart && upperPart => the box is fully in region D

In any other case, the box is crossing at least one boundary, and thus ends up in region E – the cross bucket. Doing these comparisons is a bit costly so instead we compute a 4-bit mask from the test results and use it as an index into a small look-up table. The disassembly shows that using a table avoids branches (this is similar to what we covered in version 8):

00AE3B90 movss xmm0,dword ptr [edx]
00AE3B94 xor ecx,ecx
00AE3B96 comiss xmm2,dword ptr [edx+0Ch]
00AE3B9A lea edx,[edx+18h]
00AE3B9D seta cl
00AE3BA0 xor eax,eax
00AE3BA2 lea esi,[esi+4]
00AE3BA5 add ecx,ecx
00AE3BA7 comiss xmm0,xmm2
00AE3BAA movss xmm0,dword ptr [edx-1Ch]
00AE3BAF seta al
00AE3BB2 or ecx,eax
00AE3BB4 xor eax,eax
00AE3BB6 add ecx,ecx
00AE3BB8 comiss xmm1,dword ptr [edx-10h]
00AE3BBC seta al
00AE3BBF or ecx,eax
00AE3BC1 xor eax,eax
00AE3BC3 add ecx,ecx
00AE3BC5 comiss xmm0,xmm1
00AE3BC8 seta al
00AE3BCB or ecx,eax
00AE3BCD movzx eax,byte ptr [ecx+0B04250h]
00AE3BD4 mov dword ptr [esi-4],eax
00AE3BD7 inc dword ptr [esp+eax*4+0A0h]
00AE3BDE dec edi
00AE3BDF jne CompleteBoxPruning+1B0h (0AE3B90h)

It may not be optimal but it is good enough for now. The table is simple enough to derive. We organize the mask like this:

lowerPart | upperPart | leftPart | rightPart

Thus we get:

  • 0000 - region E
  • 0001 - region E
  • 0010 - region E
  • 0011 - leftPart/rightPart both set, not possible
  • 0100 - region E
  • 0101 - upperPart && rightPart => region D
  • 0110 - upperPart && leftPart => region B
  • 0111 - leftPart/rightPart both set, not possible
  • 1000 - region E
  • 1001 - lowerPart && rightPart => region C
  • 1010 - lowerPart && leftPart => region A
  • 1011 - leftPart/rightPart both set, not possible
  • 1100 - lowerPart/upperPart both set, not possible
  • 1101 - lowerPart/upperPart both set, not possible
  • 1110 - lowerPart/upperPart both set, not possible
  • 1111 - leftPart/rightPart both set, not possible

Once the bucket indices are available for all boxes, BoxXListBuffer and BoxListYZBuffer are filled with sorted boxes, like we did in version 16. The only difference is that boxes are stored per bucket there: all boxes of bucket 0 (sorted along X), then all boxes of bucket 1 (sorted along X), and so on. This part is just simple pointer and counter book-keeping, no major issue. Just remember to write all the necessary sentinels at the end of each section within the array.

At this point the data is ready and we can do the actual pruning. In version 16, we only had one “complete box pruning” function running there. In version 17 we will need to run multiple “complete box pruning” calls on 5 distinct parts of the arrays, and additional “bipartite box pruning” calls. Thus we first copy the corresponding code into separate functions:

This is the same code otherwise as in version 16; we just move it to separate functions.

Finally, we do the sequence of complete and bipartite pruning calls that we mentioned in a previous paragraph:

In these two for loops, i is the bucket index. We first find overlaps within each of the 5 buckets (the first for loop), then we find overlaps between bucket 4 (the cross bucket) and the 4 first natural buckets.

That’s it.

As promised, the modifications we made to try this new approach are somewhat minimal. There is no need to introduce a hash-map or more complicated memory management, it’s all rather simple.

Admittedly, the approach does not guarantee performance gains compared to version 16. We could very well find degenerate cases where all objects end up in the same bucket. But even in these bad cases we would effectively end up with the same as version 16, with a modest overhead introduced by the bounding box computation and box classification. Not the end of the world. Most of the time objects are distributed fairly homogeneously and the approach gives clear performance gains.

It certainly does in our test project, in any case:

New office PC – Intel i7-6850K

Timings (K-Cycles)

Overall X factor

Version2 - base

66245

1.0

Version14d – integer cmp 2

5452

~12.15

Version15a – SSE2 intrinsics

5676

~11.67

Version15b – SSE2 assembly

3924

~16.88

Version15c – AVX assembly

2413

~27.45

Version16 – revisited pair reporting

4891

~13.54

Version17 – multi box pruning

3763

~17.60

Home laptop – Intel i5-3210M

Timings (K-Cycles)

Overall X factor

Version2 - base

62324

1.0

Version14d – integer cmp 2

5011

~12.43

Version15a – SSE2 intrinsics

5641

~11.04

Version15b – SSE2 assembly

4074

~15.29

Version15c – AVX assembly

2587

~24.09

Version16 – revisited pair reporting

4743

~13.14

Version17 – multi box pruning

3377

~18.45

Home desktop PC

Timings (K-Cycles)

Overall X factor

Version2 - base

98822

1.0

Version14d – integer cmp 2

7386

~13.37

Version15a – SSE2 intrinsics

16981

~5.81

Version15b – SSE2 assembly

6657

~14.84

Version15c – AVX assembly

Crash (AVX not supported)

0

Version16 – revisited pair reporting

7231

~13.66

Version17 – multi box pruning

5083

~19.44

What do we get?

We see clear gains compared to our previous version 16 on all machines. This is the only “apples-to-apples” comparison we have here, since the pruning code is effectively the same in both versions. So we only measure the effect of our high-level optimization here, and we can conclude that it does work.

Perhaps more interestingly, version 17 is also faster than version 15b on all tested machines. That is, our C++ SSE2 version is now faster than our best assembly SSE2 version. This is pretty good because it was the opposite before: version 15b was faster than version 16 on all machines. Of course the comparison is a bit unfair here, and we could port the high-level optimization to version 15b to get some gains there as well. However we would need to write the whole bipartite function in assembly, so that is a lot more work than what we did for the C++ version. Exercise left to the readers and all that (or to cough Fabian cough :)).

Finally, on machines that support it, the AVX version remains the fastest. Or so it seems.

The trap of the unique benchmark

In this series we have consistently made one cardinal mistake. Something that experienced programmers know is a very bad idea. We ignored the first rule of optimization 101:

Use more than one benchmark.

You need to start somewhere of course, and we’ve gone a long way with our unique test scenario in this project – so a single benchmark does provide value, and it is certainly better than no benchmark at all.

However, using a single benchmark has its perils. By nature it tends to test a single codepath, a single configuration, a single way to navigate through your code. And of course, there is absolutely no guarantee that a different scenario produces the same performance profile. There is no guarantee that a specific optimization helps in all cases. Change the input data, and maybe you reach a different conclusion. Maybe your algorithm does not scale. Maybe something helps for a large number of objects, but hurts for a small number of them. Maybe one optimization helps when a lot of boxes do overlap, but hurts when the same boxes are disjoint. And so on. There is a huge amount of combinations, codepaths, configurations, which are as many traps one can fall into.

And thus, at some point one must start using more than one benchmark. In PhysX for example, I remember creating 4 or 5 different benchmarks just to test a single box-vs-triangle overlap function (similar to my blog post here). The function had multiple codepaths, some with early exits, and I created a separate benchmark for each of them. This is impossible to do for a full physics engine, but that is the spirit.

Now this is a toy project here so I will not go crazy with the number of scenarios we support: I will just add one more, to prove a point.

Go back to the main source file where we define the test. Find this line:

const udword NbBoxes = 10000;

Then just add a zero:

const udword NbBoxes = 100000;

What happens when we test ten times more boxes?

Well the first thing one notices is that our initial versions become insufferably slow. The code now finds and reports 1144045 pairs and it brings our first versions to their knees. It is so bad that our profiling function cannot even measure the time properly: our naïve version only returned the lower 32bit part of the TSC counter (which is a 64bit value), and the initial code is so slow that we wrap around, producing meaningless timings.

So I will just ignore the first versions and show the results for latest ones. This is now for 100000 boxes:

New office PC – Intel i7-6850K

Timings (K-Cycles)

Overall X factor

Version15a – SSE2 intrinsics

536815

-

Version15b – SSE2 assembly

374524

-

Version15c – AVX assembly

254231

-

Version16 – revisited pair reporting

490841

-

Version17 – multi box pruning

182715

-

Home laptop – Intel i5-3210M

Timings (K-Cycles)

Overall X factor

Version15a – SSE2 intrinsics

535593

-

Version15b – SSE2 assembly

362464

-

Version15c – AVX assembly

370017

-

Version16 – revisited pair reporting

495961

-

Version17 – multi box pruning

188884

-

Home desktop PC

Timings (K-Cycles)

Overall X factor

Version15a – SSE2 intrinsics

1737408

-

Version15b – SSE2 assembly

687806

-

Version15c – AVX assembly

Crash (AVX not supported)

-

Version16 – revisited pair reporting

919140

-

Version17 – multi box pruning

312065

-

While the timings generally follow the same pattern as in our previous test with less boxes, we clearly see that version 17 is now the fastest on all tested machines. The high-level optimization was more effective for a large number of boxes than it was for a “small” number of them.

I did not do a thorough analysis to explain why but generally speaking it is true that simple brute-force versions can be faster than smarter versions when you only have a small amount of items to process. In our case we added some overhead to compute the initial bounding box and classify boxes into buckets, which is an O(n) part, while the following pruning loops are not O(n) (as you can see with the timings: we multiplied the amount of boxes by 10 but the time it takes to process them grew by more than 10). The relative cost of the “preparation” part compared to the “pruning” part has an impact on which version is eventually the fastest.

And then again, there are different codepaths within the pruning part itself (how many overlap tests do we end up with? How many of them report a hit?) and increasing the amount of boxes might add some pressure on a codepath that was previously less traveled.

In short: one benchmark is not enough.

The bipartite case

We mentioned that until now, all our improvements to the complete case were equally valid for the bipartite case. But the optimizations for the bipartite case will be slightly different in this version, since we cannot exactly replicate there what we did for the complete case.

What we can do is the following:

  • Compute buckets for both incoming sets of objects A and B (similar to what we did before)
  • Compute buckets’ bounds (this is new, we did not need them in the complete case)
  • Do a bipartite box pruning call between each bucket of A and each bucket of B, if their respective bounds overlap.

And with this, we can wrap this part up.

What we learnt:

Low-level optimizations are important but not the only thing to care about. High-level algorithmic optimizations are equally important.

One benchmark is not enough. Some optimizations might work in some scenarios, while being detrimental in others. Use multiple benchmarks with various workloads and various configurations, to stress multiple codepaths.

We can make the complete-box-pruning codepath faster using the bipartite-box-pruning codepath. But we cannot make the bipartite case faster using the complete case. For the first time the bipartite case needed specific changes.

We reached our goal of being faster than the AVX version, at least in some scenarios.

So… are we finally done?

Ah.

Not really.

We just opened the door to high-level optimizations, so we still have a long way to go.

GitHub code for version 17 is here.

Box pruning revisited - Part 16 - improved pair reporting

Thursday, May 10th, 2018

Part 16 – improved pair reporting

In Part 15 we saw that there were some cycles to save in the part of the code that reports pairs.

By nature, the AVX version had to report multiple results at the same time. Thus in this context it was natural to end up modifying the part that reports overlaps.

But there was no incentive for me to touch this before, because that part of the code is not something that would survive as-is in a production version. In the real world (e.g. in an actual physics engine), the part reporting overlapping pairs tends to be more complicated (and much slower) anyway. In some cases, you need to output results to a hash-map instead of just a dynamic vector. In some cases, you need to do more complex filtering before accepting a pair. Sometimes that means going through a user-defined callback to let customers apply their own logic and decide whether they want the pair or not. All these things are much slower than our simple pair reporting, so basically this is not the place we should bother optimizing.

That being said, the code reporting overlaps does have obvious issues and in the context of this project it is certainly fair game to address them. So, let’s revisit this now.

So far we have used a simple container class to report our pairs, which was basically equivalent to an std::vector<int>. (If you wonder why I didn’t use an actual std::vector, please go read Part 11 again).

The code for adding an int to the container (equivalent to a simple push_back) was:

And the code to add a pair to the array was something like:

So there are two obvious issues here:

  • Because we add the two integers making up a pair separately, we do two resize checks (two comparisons) per pair instead of one. Now the way the benchmark is setup, we resize the array during the first run, but never again afterwards. So in all subsequent runs we always take the same branch, and in theory there shouldn’t be any misprediction. But still, that’s one more comparison & jump than necessary.
  • We update mCurNbEntries twice per pair instead of one. Since this is a class member, this is exactly the kind of things that would have given a bad load-hit-store (LHS) on Xbox 360. I did not investigate to see what kind of penalty (if any) it produced on PC in this case, but regardless: we can do better.

I am aware that none of these issues would have appeared if we would have used a standard std::vector<Pair> for example. However, we would have had other issues - as seen in a previous report. I used this non-templated custom array class in the original code simply because this is the only one I was familiar with back in 2002 (and the class itself was older than that, records show it was written around February 2000. Things were different back then).

In any case, we can address these problems in a number of ways. I just picked an easy one, and I now create a small wrapper passed to the collision code:

And the pairs are now reported as you’d expect from reading the code:

So it just does one comparison and one class member read-modify-write operation per pair, instead of two. Trivial stuff, there isn’t much to it.

But this simple change is enough to produce measurable gains, reported in the following tables. Version16 there should be compared to Version 14d – it’s the same code, only the pair reporting function has changed.

New office PC – Intel i7-6850K

Timings (K-Cycles)

Overall X factor

Version2 - base

66245

1.0

Version14d – integer cmp 2

5452

~12.15

Version15a – SSE2 intrinsics

5676

~11.67

Version15b – SSE2 assembly

3924

~16.88

Version15c – AVX assembly

2413

~27.45

Version16 – revisited pair reporting

4891

~13.54

Home laptop – Intel i5-3210M

Timings (K-Cycles)

Overall X factor

Version2 - base

62324

1.0

Version14d – integer cmp 2

5011

~12.43

Version15a – SSE2 intrinsics

5641

~11.04

Version15b – SSE2 assembly

4074

~15.29

Version15c – AVX assembly

2587

~24.09

Version16 – revisited pair reporting

4743

~13.14

Home desktop PC

Timings (K-Cycles)

Overall X factor

Version2 - base

98822

1.0

Version14d – integer cmp 2

7386

~13.37 (*)

Version15a – SSE2 intrinsics

16981

~5.81

Version15b – SSE2 assembly

6657

~14.84

Version15c – AVX assembly

Crash (AVX not supported)

0

Version16 – revisited pair reporting

7231

~13.66

(*) There was an error for this number in Part 15. It said “13.79” instead of “13.37” like in previous reports.

The gains are all over the place: 561 K-Cycles on one machine, 268 K-Cycles on another and a disappointing 155 K-Cycles on my home desktop PC. That last one rings a bell: I have a vague memory of removing the entire pair report mechanism at some point on this PC, to check how much the whole thing was costing me. The gains were so minimal I didn’t bother investigating further.

For some reason the new machines give better gains. Due to lack of time (and lack of motivation: this part of the code is not very interesting to me), I did not investigate why. It’s faster for “obvious” theoretical reasons (we do less work), we see gains in practice (at least we don’t get slower results), that’s good enough for now.

Similarly there would be more to do to fully replicate Ryg’s changes from the AVX version. By nature the AVX version reports multiple pairs at the same time, but in a way the same can be said about our unrolled loops, and we could try to use the same strategies there. For example the resize check could be done only once before a loop unrolled N times starts, making sure that there is enough space in the array to write N pairs there (without extra checks). But I did not bother investigating further at this point: I am happy with the gains we got from the trivial change, and if the AVX version still has a small additional advantage from its improved pair reporting code, so be it.

What we learnt:

The “C++” code I wrote 15 years ago was not great. That pair reporting part was a bit lame. But then again 15 years ago I had never heard of load-hit-store penalties.

Show your code. Put it online. People will pick it up and point out the issues you forgot or didn’t know you had in there.

We closed the gap a bit between our best “C++” SSE2 version and the fastest available SSE2 assembly version.

This was probably the last “easy” optimization available before tackling something much bigger.

GitHub code for version 16 is here.

shopfr.org cialis