Archive for the 'Programming' Category

Box pruning revisited - Part 15 - AVX

Thursday, May 3rd, 2018

Part 15 – AVX

In version 14b we looked at Ryg’s initial experiments with this project. If you followed his progress on GitHub, you probably already know that he went ahead and did a lot more than what we covered so far. In particular, his latest version uses AVX instructions. I couldn’t try this myself before, since my PC did not support them. But things are different now (see previous post) so it’s time to look at Ryg’s most recent efforts (which are already one year old at this point).

According to Steam’s hardware survey from April 2018, AVX is widely available (86%) but still not as ubiquitous as SSE2 (100%!):

That’s why I never really looked at it seriously so far. At the end of the day, and in the back of my mind, I would like to use these optimizations on PC but also other platforms like consoles. Hopefully we will go back to this later in the series - I did run all these tests on consoles as well, and the results are not always the same as for the PC. In PhysX we are already suffering from the power difference between a modern PC and current-gen consoles: the consoles are sometimes struggling to handle something that runs just fine on PC. Using AVX would only tip the scales even more in favor of PCs. That being said, we had the same discussion about SSE2 back in the days, and now SSE2 is everywhere… including on consoles. So it is reasonable to expect AVX to be available in all next-gen consoles as well, and the time spent learning it now is hopefully just a good investment for the future. (I would totally take that time and play with AVX just for fun if I wouldn’t have a child. But these days my free time is severely limited, as you noticed with the one year gap between 14d and 14e, so I have to choose my targets wisely).

Alright.

Like last time, Fabian was kind enough to include detailed notes about what he did. So, I will just copy-paste here what you probably already read a year ago anyway. In his own words:

Here’s what I did to the code to arrive at the current version:

I already wrote a note on the earlier changes that were just cleaning up the ASM code and unrolling it a bit. That text file is a gist and available here:

https://gist.github.com/rygorous/fdd41f45b24472649aaeb5b55bbe6e26

…and then someone on Twitter asked “what if you used AVX instead of SSE2?”. My initial response boiled down to “that’s not gonna work with how the loop is currently set up”. The problem is that the original idea of having one box=4 floats, doing all 4 compares with a single SSE compare, doing a movemask, and then checking whether we get the number that means “box intersects” (12 in the original case) fundamentally doesn’t translate well to 8-wide: now instead of one box per compare, we’re testing two (let’s call them box1[0] and box1[1]), and instead of two possible outcomes, we now have four, where ? stands for any hex digit but ‘c’:

  1. Both box1[0] and box1[1] intersect box0. (=movemask gives 0xcc)
  2. box1[0] intersects box0, box1[1] doesn’t. (=movemask gives 0x?c)
  3. box1[1] intersects box0, box1[0] doesn’t. (=movemask gives 0xc?)
  4. neither intersects box0. (=movemask gives 0x??)

Instead of the previous solution where we had exactly one value to compare against, now we need to do a more expensive test like “(mask & 0xf) == 12 || (mask & 0xf0) == 0xc0″. Not the end of the world, but it means something like

mov tmp, eax
and tmp, 15
cmp tmp, 12
je  FoundOne
and eax, 15
cmp eax, 12
je  FoundOne

which is one more temp register required and several extra uops, and as we saw in the gist, this loop was pretty tight before. Whenever something like this happens, it’s a sign that you’re working against the grain of the hardware and you should maybe take a step back and consider something different.

That “something different” in this case was converting the loop to use a SoA layout (structure of arrays, google it, enough has been written about it elsewhere) and doing some back of the envelope math (also in this repository, “notes_avx.txt”) to figure out how much work that would be per loop iteration. End result: 12 fused uops per iter to process *8* boxes (instead of 20 fused uops/iter to process 4 using SSE2!), still with a fairly decent load balance across the ports. This could be a win, and not just for AVX, but with SSE2 as well.

Addressing
———-

The first problem was that this code is 32-bit x86, which is register-starved, and going to SoA means we (in theory) need to keep around five pointers in the main loop, instead of just one.

Possible (barely) but very inconvenient, and there’s no way you want to be incrementing all of them. Luckily, we don’t have to: the first observation is that all the arrays we advance through have the same element stride (four bytes), so we don’t actually need to keep incrementing 5 pointers, because the distances between the pointers always stay the same. We can just compute those distances, increment *one* of the pointers, and use [reg1+reg2] addressing modes to compute the rest on the fly.

The second trick is to use x86’s scaling indexing addressing modes: we can not just use address expressions [reg1+reg2], but also [reg1+reg2*2], [reg1+reg2*4] and [reg1+reg2*8] (left-shifts by 0 through 3). The *8 version is not very useful to use unless we want to add a lot of padding since we’re just dealing with 6 arrays, but that narrows us down to four choices:

  1. [base_ptr]
  2. [base_ptr + dist]
  3. [base_ptr + dist*2]
  4. [base_ptr + dist*4]

and we need to address five arrays in the main loop. So spending only 2 registers isn’t really practical unless we want to allocate a bunch of extra memory. I opted against it, and choose 2 “dist” registers, one being the negation of the other. That means we can use the following:

  1. [base_ptr]
  2. [base_ptr + dist_pos]
  3. [base_ptr + dist_pos*2]
  4. [base_ptr - dist_pos] = [base_ptr + dist_neg]
  5. [base_ptr - dist_pos*2] = [base_ptr + dist_neg*2]

ta-daa, five pointers in three registers with only one increment per loop iteration. The layout I chose arranges the arrays as follows:

  1. BoxMaxX[size]
  2. BoxMinX[size]
  3. BoxMaxY[size]
  4. BoxMinY[size]
  5. BoxMaxZ[size]
  6. BoxMinZ[size]

which has the 5 arrays we keep hitting in the main loop all contiguous, and then makes base_ptr point at the fourth one (BoxMinY).

You can see this all in the C++ code already, although it really doesn’t generate great code there. The pointer-casting to do bytewise additions and subtractions is all wrapped into “PtrAddBytes” to de-noise the C++ code. (Otherwise, you wouldn’t able to see anything for the constant type casts.)

Reporting intersections (first version)
—————————————

This version, unlike Pierre’s original approach, “natively” processes multiple boxes at the same time, and only does one compare for the bunch of them.

In fact, the basic version of this approach is pretty canonical and would be easily written in something like ISPC (ispc.github.io). Since the goal of this particular version was to be as fast as possible, I didn’t do that here though, since I wanted the option to do weird abstraction-breaking stuff when I wanted to. :)

Anyway, the problem is that now, we can find multiple pairs at once, and we need to handle this correctly.

Commit id 9e171cf6 has the basic approach: our new ReportIntersections gets a bit mask of reported intersections, and adds the pairs once by one. This loop uses an x86 bit scan instruction (”bsf”) to find the location of the first set bit in the mask, remaps its ID, then adds it to the result array, and finally uses “mask &= mask - 1;” which is a standard bit trick to clear the lowest set bit in a value.

This was good enough at the start, although I later switched to something else.

The basic loop (SSE version)
—————————-

I first started with the SSE variant (since the estimate said that should be faster as well) before trying to get AVX to run. Commit id 1fd8579c has the initial implementation. The main loop is pretty much like described in notes_avx.txt (edi is our base_ptr, edx=dist_neg, ecx=dist_pos).

In addition to this main loop, we also need code to process the tail of the array, when some of the boxes have a mMinX > MaxLimit. This logic is fairly easy: identify such boxes (it’s just another compare, integer this time since we convertred the MinX array to ints) and exclude them from the test (that’s the extra “andnps” - it excludes the boxes with mMinX > MaxLimit). This one is sligthly annoying because SSE2 has no unsigned integer compares, only signed, and my initial “MungeFloat” function produced unsigned integers. I fixed it up to produce signed integers that are currently ordered in an earlier commit (id 95eaaaac).

This version also needs to convert boxes to SoA layout in the first place, which in this version is just done in straight scalar C++ code.

Note that in this version, we’re doing unaligned loads from the box arrays. That’s because our loop counters point at an arbitrary box ID and we’re going over boxes one by one. This is not ideal but not a showstopper in the SSE version; however, it posed major problems for…

The AVX version
—————

As I wrote in “notes_avx.txt” before I started, “if the cache can keep up (doubt it!)”. Turns out that on the (Sandy Bridge) i7-2600K I’m writing this on, writing AVX code is a great way to run into L1 cache bandwidth limits.

The basic problem is that Sandy Bridge has full 256-bit AVX execution, but “only” 128-bit wide load/store units (two loads and one store per cycle). A aligned 256-bit access keeps the respective unit busy for 2 cycles, unaligned 256b accesses are three (best case).

In short, if you’re using 256-bit vectors on SNB, it’s quite easy to swamp the L1 cache with requests, and that’s what we end up doing.

The initial AVX version worked, but ended up being slightly slower than the (new) SSE2 version. Not very useful.

To make it competitive, it needed to switch to aligned memory operations. Luckily, in this case, it turns out to be fairly easy: we just make sure to allocate the initial array 32-byte aligned (and make sure the distance between arrays is a multiple of 32 bytes as well, to make sure that if one of the pointers is aligned, they all are), and then make sure to get us to a 32-byte aligned address before we enter the main loop.

So that’s what the first real AVX version (commit 19146649) did. I found it a bit simpler to round the current box pointer *down* to a multiple of 32, not up. This makes the first few lanes garbage if we weren’t already 32-byte aligned; we can deal with this by masking them out, the same way we handled the mMinX > MaxLimit lanes in the tail code.

And with the alignment taken care of, the AVX version was now at 3700 Kcycles for the test on my home machine, compared to about 6200 Kcycles for the SSE2 version! Success.

Cleaning up the setup
———————

At this point, the setup code is starting to be a noticeable part of the overall time, and in particular the code to transform the box array from AoS to SoA was kind of ratty. However, transforming from AoS to SoA is a bog-standard problem and boils down to using 4×4 matrix transposition in this case. So I wrote the code to do it using SIMD instructions instead of scalar too, for about another 200 Kcycles savings (in both the AVX and SSE2 vers).

Reporting intersections (second version)
—————————————-

After that, I decided to take a look with VTune and discovered that the AVX version was spending about 10% of its time in ReportIntersections, and accruing pretty significant branch mis-prediction penalties along the way.

So, time to make it less branchy.

As a first step, added some code so that instead of writing to “Container& pairs” directly, I get to amortize the work. In particular, I want to only do the “is there enough space left” check *once* per group of (up to) 8 pairs, and grow the container if necessary to make sure that we can insert those 16 pairs without further checks. That’s what “PairOutputBuffer” is for. It basically grabs the storage from the given Container while it’s in scope, maintains our (somewhat looser) invariants, and is written for client code that just wants to poke around in the pointers directly, so there’s no data hiding here. That was finalized in commit 389bf503, and decreases the cost of ReportIntersections slightly.

Next, we switch to outputting the intersections all at once, using SIMD code. This boils down to only storing the vector lanes that have their corresponding mask bit set. This is a completely standard technique. Nicely enough, Andreas Frediksson has written it up so I don’t have to:

https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf

(The filter/”left packing” stuff). AVX only exists on machines that also have PSHUFB and POPCNT so we can use both.

This indeed reduced our CPU time by another ~250 Kcycles, putting the AVX version at about 3250 Kcycles! And that’s where it currently still is on this machine. (That version was finalized in commit id f0ca3dc1).

Porting back improvements to SSE2/Intrinsics code
————————————————-

Finally, I decided to port back some of these changes to the SSE2 code and the C++ intrinsics version. In particular, port the alignment trick from AVX to SSE2 for a very significant win in commit d92dd5f9, and use the SSE2 version of the left-packing trick in commit 69baa1f1 (I could’ve used SSSE3 there, but I didn’t want to gratuitously increase the required SSE version).

And then I later ported the left-packing trick for output to the C++ intrinsics version as well. (The alignment trick does not help in the intrinsics ver, since the compiler is not as good about managing registers and starts tripping all over its feet when you do it.)



Thank you Ryg.

That is quite a lot of stuff to digest. I guess we can first look at the results on my machines. There are 3 different versions in Fabian’s last submits:

  • an SSE2 version using intrinsics
  • an SSE2 version using assembly
  • an AVX version using assembly

And I now have 3 different PCs available: my old home desktop PC (one of the two used in the initial posts for this serie, the other one died), a new office desktop PC, and a new home laptop PC. So that’s 9 results to report. Here they are (new entries in bold):

New office PC – Intel i7-6850K

Timings (K-Cycles)

Overall X factor

Version2 - base

66245

1.0

Version3 – don’t trust the compiler

65644

-

Version4 - sentinels

58706

-

Version5 – hardcoding axes

55560

-

Version6a – data-oriented design

46832

-

Version6b – less cache misses

39681

-

Version7 – integer cmp

36687

-

Version8 – branchless overlap test

23701

-

Version9a - SIMD

18758

-

Version9b – better SIMD

10065

-

Version9c – data alignment

10957

-

Version10 – integer SIMD

12352

-

Version11 – the last branch

11403

-

Version12 - assembly

7197

-

Version13 – asm converted back to C++

8434

-

Version14a – loop unrolling

7511

-

Version14b – Ryg unrolled assembly 1

5094

~13.00

Version14c – better unrolling

5375

-

Version14d – integer cmp 2

5452

~12.15

Version15a – SSE2 intrinsics

5676

~11.67

Version15b – SSE2 assembly

3924

~16.88

Version15c – AVX assembly

2413

~27.45

Home laptop – Intel i5-3210M

Timings (K-Cycles)

Overall X factor

Version2 - base

62324

1.0

Version3 – don’t trust the compiler

59250

-

Version4 - sentinels

54368

-

Version5 – hardcoding axes

52196

-

Version6a – data-oriented design

43848

-

Version6b – less cache misses

37755

-

Version7 – integer cmp

36746

-

Version8 – branchless overlap test

28206

-

Version9a - SIMD

22693

-

Version9b – better SIMD

11351

-

Version9c – data alignment

11221

-

Version10 – integer SIMD

11110

-

Version11 – the last branch

10871

-

Version12 - assembly

9268

-

Version13 – asm converted back to C++

9248

-

Version14a – loop unrolling

9009

-

Version14b – Ryg unrolled assembly 1

5040

~12.36

Version14c – better unrolling

5301

-

Version14d – integer cmp 2

5011

~12.43

Version15a – SSE2 intrinsics

5641

~11.04

Version15b – SSE2 assembly

4074

~15.29

Version15c – AVX assembly

2587

~24.09

Home desktop PC

Timings (K-Cycles)

Overall X factor

Version2 - base

98822

1.0

Version3 – don’t trust the compiler

93138

-

Version4 - sentinels

81834

-

Version5 – hardcoding axes

78140

-

Version6a – data-oriented design

60579

-

Version6b – less cache misses

41605

-

Version7 – integer cmp

40906

-

Version8 – branchless overlap test

31383

-

Version9a - SIMD

34486

-

Version9b – better SIMD

32565

-

Version9c – data alignment

14802

-

Version10 – integer SIMD

16667

-

Version11 – the last branch

14512

-

Version12 - assembly

11731

-

Version13 – asm converted back to C++

12236

-

Version14a – loop unrolling

9012

-

Version14b – Ryg unrolled assembly 1

7600

-

Version14c – better unrolling

7558

-

Version14d – integer cmp 2

7386

~13.79

Version15a – SSE2 intrinsics

16981

~5.81

Version15b – SSE2 assembly

6657

~14.84

Version15c – AVX assembly

Crash (AVX not supported)

0

So first, we see that the performance of the SSE2 intrinsics version is quite different depending on where you run it. It is fine on my more recent machines, but it is quite bad on my older home desktop PC, where it is roughly similar to version 9c in terms of speed. That is a clear regression compared to our latest unrolled versions. I did not investigate what the problem could be, because even on modern PCs the performance is ultimately not as good as our best “version 14″. On the other hand, this version (let’s call it 15a) is not as ugly-looking as 14c or 14d. So provided we could fix its performance issue on the home desktop PC, it could be an interesting alternative which would remain somewhat portable.

Then, we have version 15b. This one is pretty good and offers a clear speedup over our previous SSE2 versions. This is interesting because it shows that without going all the way to AVX (i.e. without losing compatibility with some machines), the AVX “philosophy” if you want (re-organizing the code to be AVX-friendly) still has some potential performance gains. Ideally, we would be able to get these benefits in the C++ code as well. Admittedly this may not be easy since this is essentially what 15a failed to do, but we might be able to try again and come up with a better 15a implementation. Somehow.

Finally, version 15c is the actual AVX version. Unsurprisingly if I bypass the AVX check and run it on my home desktop PC, it crashes - since that PC does not support AVX. On the other hand, on the machines that do support it, performance is awesome: we get pretty much the advertised 2X speedup over regular SIMD that AVX was supposed to deliver. And thus, with this, we are now about 24X to 27X faster than the original code. Think about that next time somebody claims that low-level optimizations are not worth it anymore, and that people should instead focus on multi-threading the code: you would need a 24-core processor and perfect scaling to reach an equivalent speedup with multi-threading…

So, where do we go from here?

I had plans for where to move this project next, but now a whole bunch of new questions arose.

Some of these optimizations like the one done for reporting intersections in a less branchy way seem orthogonal to AVX, and could potentially benefit our previous versions as well. The way these AVX versions have been delivered, combining multiple new optimizations into the same new build, it is slightly unclear how much the performance changed with each step (although Ryg’s notes do give us a clue). I usually prefer doing one optimization at a time (each in separate builds) to make things clearer. In any case, we could try to port the improved code for reporting intersections to version 14 and see what happens.

Fabian initially claimed that the design did not translate well to 8-wide, and thus we had to switch to SoA. I didn’t give it much thoughts but I am not sure about this yet. I think the dismissed non-existing version where we would test 2 boxes at a time with AVX could still give us some gains. The movemask test becomes slightly more expensive, yes, but we still drop half of the loading+compare+movemask instructions. That must count for something? Maybe I am being naive here.

Another thing to try would be to dive into the disassembly of version 15a, figure out why it performs badly on the home desktop PC, and fix it / improve it. That could be a worthwhile goal because I really didn’t want to move further into assembly land. Quite the opposite: one of the planned next posts was about checking the effects of these optimizations on ARM and different architectures. Assembly versions are a show-stopper there - we cannot even have inline assembly on Win64 these days. So at the very least versions 15b and 15c give us new targets and show us what is possible in terms of performance. But I will have difficulties keeping them around for long.

And this brings us to the obvious question about 15c: could we try it using AVX intrinsics instead of assembly? That could be a way to keep some portability (at least between Win32 and Win64) while still giving us some of the AVX performance gains.

Another thing that comes to mind is that we saw in part 3 that the sorting was costing us at best 140 K-Cycles (and in reality, much more). This was negligible at the time, but Ryg’s latest optimizations were about saving ~200 K-Cycles, so this part is becoming relevant again. One strategy here, if we don’t want to deal with this just yet, could be to reset the test and use more boxes. We used 10000 boxes so far, but we could just add a 0 there and continue our journey.

Beyond that, I had further optimizations planned for the whole project, which are completely orthogonal to AVX. So I could just continue with that and ignore the AVX versions for now. A new goal could be for me to reach the same performance as the AVX assembly version, but using another way, with regular SSE intrinsics.

Here is a potential TODO list for further reports:

  1. Try the optimized intersections report in version 14.
  2. Try an AVX version that tests 2 boxes at a time without SoA.
  3. Analyze the 15a disassembly and try to fix it (make it fast on my home desktop PC).
  4. Try a version using AVX intrinsics.
  5. Revisit the sorting code (and general setup code) in version 14/15.
  6. Go ahead with the initial plan and further non-AVX-related optimizations (that’s at least 3 more blog posts there).
  7. Once it’s done, merge AVX optimizations to these new versions for further speedup.
  8. When applicable, check the performance of all these versions on other platforms / architecture. That’s when having a separate build per optimization pays off: do we have some optimizations that hurt instead of help on some platforms?
  9. Investigate how the performance varies and which version is the fastest when the number of objects changes.
  10. Explain what it takes to productize this and make it useful in a real physics engine (in particular: how you deal with sleeping objects and how you report new and deleted pairs instead of all of them).
  11. Field test.


I don’t know yet what I will try next, or when, but it seems that there is indeed, more than ever, still a lot to do.

What we learnt:

AVX works. It does not happen every day but it can make your code 2X faster than regular SSE.

You might need assembly for that to happen though. Like it or not, assembly wins again today.

Do not ignore low level optimizations. In our case there was a 24X performance gain on the table (so far), without changing the algorithm, without multi-threading. Typical multi-threading will give you much less than that.

GitHub code for version 15 is here.

I apologize for the lack of new material, it is really just the same as what Fabian published a year ago. I did a minor modification to be able to run the SSE2 versions on AVX-enabled machines, so you now have to select the desired version with a define in the code.

Box pruning revisited - part 14e - a bugfix and a reboot

Wednesday, May 2nd, 2018

Part 14e – a bugfix and a reboot

More than a year after version 14d, I finally come back to this project for a brief update.

So, basically, a year ago my office PC died, and it rendered half of the published timings obsolete and irrelevant. I got a replacement PC, with a more recent CPU and different features (AVX in particular). I thought I would pause the project for maybe a month (a lot of things ended up on my plate at work), but before I knew it a year had disappeared.

Oh well. That’s life. Especially when you have kids.

Another thing that happened is that I found a bug in version 14d. The code computing the box index in the bipartite case was wrong (off by one error). Since I had no validity test for the bipartite case, and because the reported number of pairs was correct, I did not notice that one for a while.

Right. So here is version 14e, which is pretty much the same as version 14d except:

  • The bug has been fixed.
  • A validity test for the bipartite case has been added.
  • The bipartite case has been refactored to avoid duplicating the loop.

The bugfix has no impact on performance. We only tracked the performance of the “complete box pruning” codepath anyway.

What does have an impact on performance though, is the new PC. I had no choice but re-measure all versions on this new machine. The results are listed in Table 1.

While I was at it, I ran the same tests on my wife’s laptop at home. I should have tried that sooner: turns out her laptop is more advanced than my desktop PC – it has AVX, and it is much faster! The results for that laptop are listed in Table 2.

We only keep what we previously called “safe” versions now. We do not list the timings for the “unsafe” versions anymore. I also dropped the Delta and Speedup columns to make things easier.

New office PC – Intel i7-6850K

Timings (K-Cycles)

Overall X factor

Version2 - base

66245

1.0

Version3 – don’t trust the compiler

65644

-

Version4 - sentinels

58706

-

Version5 – hardcoding axes

55560

-

Version6a – data-oriented design

46832

-

Version6b – less cache misses

39681

-

Version7 – integer cmp

36687

-

Version8 – branchless overlap test

23701

-

Version9a - SIMD

18758

-

Version9b – better SIMD

10065

-

Version9c – data alignment

10957

-

Version10 – integer SIMD

12352

-

Version11 – the last branch

11403

-

Version12 - assembly

7197

-

Version13 – asm converted back to C++

8434

-

Version14a – loop unrolling

7511

-

Version14b – Ryg unrolled assembly 1

5094

~13.00

Version14c – better unrolling

5375

-

Version14d – integer cmp 2

5452

~12.15

Table 1 – results for new office desktop PC

Home laptop – Intel i5-3210M

Timings (K-Cycles)

Overall X factor

Version2 - base

62324

1.0

Version3 – don’t trust the compiler

59250

-

Version4 - sentinels

54368

-

Version5 – hardcoding axes

52196

-

Version6a – data-oriented design

43848

-

Version6b – less cache misses

37755

-

Version7 – integer cmp

36746

-

Version8 – branchless overlap test

28206

-

Version9a - SIMD

22693

-

Version9b – better SIMD

11351

-

Version9c – data alignment

11221

-

Version10 – integer SIMD

11110

-

Version11 – the last branch

10871

-

Version12 - assembly

9268

-

Version13 – asm converted back to C++

9248

-

Version14a – loop unrolling

9009

-

Version14b – Ryg unrolled assembly 1

5040

~12.36

Version14c – better unrolling

5301

-

Version14d – integer cmp 2

5011

~12.43

Table 2 – results for home laptop

We can see that the new machines are faster overall than the ones we used before, but overall the results are pretty similar to what we previously saw.

What we learnt:

A bug can remain invisible for a year, even when the code is public on GitHub. I guess nobody tried to use it.

Time passes way too quickly.

We are back on track.

GitHub code for part 14e

Radix Redux

Tuesday, March 20th, 2018

Quick little experiment on GitHub:

https://github.com/Pierre-Terdiman/RadixRedux

Related small article is here.

I’m afraid I don’t have a lot of time these days.

Box pruning revisited - part 14d - integer comparisons redux

Tuesday, March 7th, 2017

Part 14d – integer comparisons redux

In this part, we complete the port of Fabian Giesen’s code (version 14b) to C++.

In part 7 we replaced float comparisons with integer comparisons (a very old trick), but dismissed the results because the gains were too small to justify the increase in code complexity. However we made the code significantly faster since then, so the relatively small gains we got at the time might be more interesting today.

Moreover, Fabian’s version uses integers for the X’s values only. This may also be a more interesting strategy than using them for everything, like in version 7.

Let’s try!

Replicating this in the C++ version is trivial. Read part 7 again for the details. The only difference is that Fabian uses a different function to encode the floats:

// Munge the float bits to return produce an unsigned order-preserving
// ranking of floating-point numbers.
// (Old trick: http://stereopsis.com/radix.html FloatFlip, with a new
// spin to get rid of -0.0f)
// In /fp:precise, we can just calc “x + 0.0f” and get what we need.
// But fast math optimizes it away. Could use #pragma float_control,
// but that prohibits inlining of MungeFloat. So do this silly thing
// instead.
float g_global_this_always_zero = 0.0f;
static inline udword MungeFloat(float f)
{
union
{
float f;
udword u;
sdword s;
} u;
u.f = f + g_global_this_always_zero; // NOT a nop! Canonicalizes -0.0f to +0.0f
udword toggle = (u.s >> 31) | (1u << 31);
return u.u ^ toggle;
}

While my version from part 7 was simply:

static __forceinline udword encodeFloat(udword ir)
{
if(ir & 0×80000000) //negative?
return ~ir;//reverse sequence of negative numbers
else
return ir | 0×80000000; // flip sign
}

So it’s pretty much the same: given the same input float, the two functions return the same integer value, except for -0.0f. But that’s because Fabian’s code transforms -0.0f to +0.0f before doing the conversion to integer, it’s not a side-effect of the conversion itself.

This is not really needed, since both the sorting code and the pruning code can deal with -0.0f just fine. However it is technically more correct, i.e. more in line with what float comparisons would give, since with floats a positive zero is equal to a negative zero. So it is technically more correct to map both to the same integer value – which is what Fabian’s code does.

In practice, it means that my version could produce “incorrect” results when positive and negative zeros are involved in the box coordinates. But it would only happen in edge cases where boxes exactly touch, so it would be as “incorrect” as our “unsafe” versions from past posts, and we already explained why they weren’t a big issue. Still, Fabian’s version is technically superior, even if the code does look a bit silly indeed - but in a cute kind of way.

Now a perhaps more interesting thing to note is that Fabian’s version (well, Michael Herf’s version I suppose) is branchless. So could it be measurably faster?

Without further ado, here are the results on my machines – new entries in bold letters:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

98822

0

0%

1.0

Version13 - safe

12236

~2200

~15%

~8.07

Version14b – Ryg/Unsafe

7600

~4100

~35%

~13.00

Version14c - safe

7558

~4600

~38%

~13.07

Version14d - P

7211

~340

~4%

~13.70

Version14d - F

7386

~170

~2%

~13.37

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

Version13 - safe

10053

~2500

~20%

~9.23

Version14b – Ryg/Unsafe

7641

~2300

~23%

~12.15

Version14c - safe

7255

~2700

~27%

~12.80

Version14d - P

7036

~210

~3%

~13.20

Version14d - F

6961

~290

~4%

~13.34

Version 14d uses integer comparisons for X’s. The P variant uses Pierre’s encoding function (“encodeFloat”), while the F variant uses Fabian’s (“MungeFloat”). The deltas are computed against Version 14c this time, to measure the speedup due to integer comparisons (rather than the speedup due to loop unrolling + integer comparisons).

The first thing we see is that indeed, using integer comparisons is measurably faster. This is not a big surprise since we saw the same in Version 7. But the gains are still very small (in particular, smaller than the theoretical 6% predicted by Ryg’s analysis) and to be honest I would probably still ignore them at this point. But using integers just for X’s is easy and doesn’t come with the tedious switch to integer SIMD intrinsics, so it’s probably not a big deal to keep them in that case.

On the other hand…

For some reason encodeFloat” is faster on my home PC, while on my office PC it’s slower (and “MungeFloat” wins). This is unfortunate and slightly annoying. This is the kind of complications that I don’t mind dealing with if the gains are important, but for such small gains it starts to be a lot of trouble for minor rewards. I suppose I could simply pick up Ryg’s version because it’s more correct, and call it a day. That gives a nicely consistent overall X factor (13.37 vs 13.34) on the two machines.

And with this, we complete the port of Fabian’s assembly version to C++. Our new goal has been reached: we’re faster than version 14b now… at least on these two machines.

What we learnt:

An optimization that didn’t provide “significant gains” in the past might be worth revisiting after all the other, larger optimizations have been applied.

Similarly, we are slowly reaching a point where small differences in the setup code become measurable and worth investigating. There might be new optimization opportunities there. For example the question we previously dismissed about what kind of sorting algorithm we should use might soon come back to the table.

In any case, for now we reached our initial goal (make the code an order of magnitude faster), and we reached our secondary goal (make the C++ code faster than Ryg’s assembly version).

Surely we’re done now!?

How much longer can we keep this going?

Well… Don’t panic, but there is still a lot to do.

Stay tuned!

GitHub code for part 14d

Box pruning revisited - part 14c - that’s how I roll

Friday, March 3rd, 2017

Part 14c – that’s how I roll

Our goal today is to look at the details of what Fabian “Ryg” Giesen did in version 14b (an assembly version), and replicate them in our C++ unrolled version (14a) if possible.

First, let’s get one thing out of the way: I will not switch back to integer comparisons in this post. I like to do one optimization at a time, as you can probably tell by now, so I will leave this stuff for later. This means we can ignore the MungeFloat function and the integer-related changes in Fabian’s code.

Then, the first thing you can see is that the code has been separated in two distinct loops: a fast one (starting with the FastLoop label), and a safe one (starting with the CarefulLoop label).

One problem when unrolling the initial loop is that we don’t know ahead of time how many iterations we will have to do (it can stop at any time depending on the value of X we read from the buffer). It is much easier to unroll loops that are executed a known number of times when the loop starts.

Sometimes in this situation, one can use what I call the “radix sort strategy”: just use two passes. Count how many iterations or items you will have to deal with in a first pass, then do a second pass taking advantage of the knowledge. That’s what a radix-sort does, creating counters and histograms in a first pass. But that kind of approach does not work well here (or at least I didn’t manage to make it work).

Fabian’s approach is to just “look ahead” and check that the buffer still has at least 4 valid entries. If it does, he uses the “fast loop”. Otherwise he falls back to the “safe loop”, which is actually just our regular non-unrolled loop from version 12. In order to look ahead safely, the sentinel values are replicated as many times as we want to unroll the loop. This is a rather simple change in the non-assembly part of the code. First there:

SIMD_AABB_X* BoxListX = new SIMD_AABB_X[nb+5];

And then there:

BoxListX[nb+1].mMinX = ~0u;
BoxListX[nb+2].mMinX = ~0u;
BoxListX[nb+3].mMinX = ~0u;
BoxListX[nb+4].mMinX = ~0u;

That’s not assembly so no problem porting this bit to the C++ version.

Now, the “fast loop” is fast for three different reasons. First, it is unrolled four times, getting rid of the corresponding branching instructions – same as in our version 14a. Second, because we looked ahead and we know the four next input values are all valid, the tests against the MaxLimit value can also be removed. And finally, the idea we wanted to test at the end of 14a has also been implemented, i.e. we don’t need to increase the Offset value for each box (we can encode that directly into the address calculation).

At the end of the day, the core loop in Fabian’s version is thus:

// Unroll 0
movaps xmm3, xmmword ptr [edx+ecx*2+0] // Box1YZ
cmpnleps xmm3, xmm2
movmskps eax, xmm3
cmp eax, 0Ch
je FoundSlot0

// Unroll 1
movaps xmm3, xmmword ptr [edx+ecx*2+16] // Box1YZ
cmpnleps xmm3, xmm2
movmskps eax, xmm3
cmp eax, 0Ch
je FoundSlot1

// Unroll 2
movaps xmm3, xmmword ptr [edx+ecx*2+32] // Box1YZ
cmpnleps xmm3, xmm2
movmskps eax, xmm3
cmp eax, 0Ch
je FoundSlot2

// Unroll 3
movaps xmm3, xmmword ptr [edx+ecx*2+48] // Box1YZ
add ecx, 32 // Advance
cmpnleps xmm3, xmm2
movmskps eax, xmm3
cmp eax, 0Ch
jne FastLoop

That is only 5 instructions per box, compared to the 8 we got in version 14a. Color-coding it reveals what happened: in the same way that we moved the green blocks out of the loop in version 14a, Fabian’s version moved the blue blocks out of the (fast) loop. There is only one surviving blue instruction, to increase our offset only once for 4 boxes.

Pretty neat.

In our C++ code it would mean that the two lines marked in bold letters would / should vanish from our BLOCK macro:

Now another difference is that since we don’t increase the offset each time, we cannot jump to the same address at each stage of the unrolled code. You can see that in Fabian’s code, which jumps to different labels (FoundSlot0, FoundSlot1, FoundSlot2, or FastFoundOne). This is easy to replicate in C++ using goto. If you don’t want to use goto, well, good luck.

And that’s pretty much it. Let’s try to replicate this in C++.

As we said, replicating the setup code is trivial (it was already done in C++).

For the safe loop, we are actually going to use our previous unrolled VERSION3 from part 14a. In that respect this is an improvement over Fabian’s code: even our safe loop is unrolled. From an implementation perspective it couldn’t be more trivial: we just leave the code from part 14a as-is, and start writing another “fast” unrolled loop just before – the fallback to the safe loop happens naturally.

Now for our fast loop, we transform the BLOCK macro as expected from the previous analysis:

As we mentioned, the lines previously marked in bold vanished. Then we added two extra parameters: one (“x”) to include the offset directly in the address calculation (as we wanted to do at the end of version 14a, and as is done Fabian’s code), and another one (“label”) to make the code jump to a different address like in the assembly version.

Now, one small improvement over Fabian’s code is that we will put the “overlap found” code before the fast loop starts, not after it ends. That’s what we did in version 14a already, and it saves one jump.

Another improvement is that we’re going to unroll 5 times instead of 4, as we did in version 14a. That’s where using BLOCK macros pays off: unrolling one more time is easy and doesn’t expand the code too much.

After all is said and done, the code becomes:

I know what you’re going to say (hell, I know what you did say after I posted a preview of part 14): it looks horrible.

Sure, sure. But once again: see through the C++, and check out the disassembly for our fast loop:

001E30B0 comiss xmm2,dword ptr [edi+esi+28h]
001E30B5 jb StartLoop4+12Fh (01E31D4h)
{
BLOCK4(0, FoundOverlap0)
001E30BB movaps xmm0,xmmword ptr [ecx-20h]
001E30BF cmpnltps xmm0,xmm1
001E30C3 movmskps eax,xmm0
001E30C6 cmp eax,0Fh
001E30C9 je StartLoop4+9Bh (01E3140h)
BLOCK4(8, FoundOverlap1)
001E30CB movaps xmm0,xmmword ptr [ecx-10h]
001E30CF cmpnltps xmm0,xmm1
001E30D3 movmskps eax,xmm0
001E30D6 cmp eax,0Fh
001E30D9 je StartLoop4+8Bh (01E3130h)
BLOCK4(16, FoundOverlap2)
001E30DB movaps xmm0,xmmword ptr [ecx]
001E30DE cmpnltps xmm0,xmm1
001E30E2 movmskps eax,xmm0
001E30E5 cmp eax,0Fh
001E30E8 je StartLoop4+7Bh (01E3120h)
BLOCK4(24, FoundOverlap3)
001E30EA movaps xmm0,xmmword ptr [ecx+10h]
001E30EE cmpnltps xmm0,xmm1
001E30F2 movmskps eax,xmm0
001E30F5 cmp eax,0Fh
001E30F8 je StartLoop4+6Dh (01E3112h)
// BLOCK4(32, FoundOverlap4)
Offset += 40;
BLOCK4(-8, FoundOverlap)
001E30FA movaps xmm0,xmmword ptr [ecx+20h]
001E30FE add ecx,50h
001E3101 cmpnltps xmm0,xmm1
001E3105 add esi,28h
001E3108 movmskps eax,xmm0
001E310B cmp eax,0Fh
001E310E jne StartLoop4+0Bh (01E30B0h)
}
001E3110 jmp StartLoop4+0ABh (01E3150h)

That’s pretty much perfect.

We get an initial comiss instruction instead of cmp because we didn’t bother switching X’s to integers, and we see the loop has been unrolled 5 times instead of 4, but other than that it’s virtually the same as Fabian’s code, which is what we wanted.

We get the following results:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

98822

0

0%

1.0

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version13 - safe

12236

~2200

~15%

~8.07

Version14a - VERSION3

9012

~3200

~26%

~10.96

Version14b – Ryg/Unsafe

7600

~4100

~35%

~13.00

Version14c - safe

7558

~4600

~38%

~13.07

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version13 - safe

10053

~2500

~20%

~9.23

Version14a - VERSION3

8532

~1500

~15%

~10.88

Version14b – Ryg/Unsafe

7641

~2300

~23%

~12.15

Version14c - safe

7255

~2700

~27%

~12.80

The deltas in the results are compared to version 13, similar to what we did for version 14a.

Thanks to our small improvements, this new version is actually faster than version 14b (at least on my machines) – without using integers! As a bonus, this is based on the “safe” version 14a rather than the “unsafe” version 12.

What we learnt:

Once again the assembly version showed us the way. I am not sure I would have “seen” how to do this one without an assembly model I could copy.

Ugly C++ can generate pretty good looking assembly – and vice versa.

Unrolling is like SIMD: tricky. It’s easy to get gains from some basic unrolling but writing the optimal unrolled loop is quite another story.

Stay tuned. In the next post we will complete our port of Fabian’s code to C++, and revisit integer comparisons.

GitHub code for part 14c

Box pruning revisited - part 14b - Ryg rolling

Friday, March 3rd, 2017

Part 14b – Ryg rolling

After I wrote about this project on Twitter, Fabian “Ryg” Giesen picked it up and made it his own. For those who don’t know him, Fabian works for RAD Game Tools and used to be / still is a member of Farbrausch. In other words, we share the same demo-scene roots. And thus, it is probably not a surprise that he began hacking the box-pruning project after I posted version 12 (the assembly version).

Now, at the end of part 14a we thought we could still improve the unrolled code by taking advantage of the address calculation to get rid of some more instructions. As it turns out, Fabian’s code does that already.

And much more.

Since he was kind enough to write some notes about the whole thing, I will just copy-paste his own explanations here. This is based on my assembly version (i.e. box pruning version 12), and this is just his initial attempt at optimizing it. He did a lot more than this afterwards. But let’s do one thing at a time here.

In his own words:

—-

Brief explanation what I did to get the speed-up, and the thought process behind it.

The original code went:

My first suggestion was to restructure the loop slightly so the hot “no overlap” path is straight-line and the cold “found overlap” path has the extra jumps. This can help instruction fetch behavior, although in this case it didn’t make a difference. Nevertheless, I’ll do it here because it makes things easier to follow:

Alright, so that’s a nice, sweet, simple loop. Now a lot of people will tell you that out-of-order cores are hard to optimize for since they’re “unpredictable” or “fuzzy” or whatever. I disagree: optimizing for out-of-order cores is *easy* and far less tedious than say manual scheduling for in-order machines is. It’s true that for OoO, you can’t just give a fixed “clock cycles per iteration” number, but the same thing is already true for *anything* with a cache, so who are we kidding? The reality of the situation is that while predicting the exact flow uops are gonna take through the machine is hard (and also fairly pointless unless you’re trying to build an exact pipeline simulator), quantifying the overall statistical behavior of loops on OoO cores is often easier than it is for in-order machines. Because for nice simple loops like this, it boils down to operation counting - total number of instructions, and total amount of work going to different types of functional units. We don’t need to worry about scheduling; the cores can take care of that themselves, and the loop above has no tricky data dependencies between iterations (the only inter-iteration change is the “add ecx, 8″, which doesn’t depend on anything else in the loop) so everything is gonna work fine on that front. So, on to the counting. I’m counting two things here: 1. “fused domain” uops (to a first-order approximation, this means “instructions as broken down by the CPU front-end”) and 2. un-fused uops going to specific groups of functional units (”ports”), which is what the CPU back-end deals with. When I write “unfused p0″, I mean an unfused uop that has to go to port 0. “unfused 1 p23″ is an unfused uop that can go to ports 2 or 3 (whichever happens to be free). I’m using stats for the i7-2600K in my machine (Intel Sandy Bridge); newer CPUs have slightly different (but still similar) breakdowns. Now without further ado, we have:

(yes, the pair of x86 instructions cmp+je combines into one fused uop.)

Fused uops are the currency the CPU frontend deals in. It can process at most 4 of these per cycle, under ideal conditions, although in practice (for various reasons) it’s generally hard to average much more than 3 fused uops/cycle unless the loop is relatively short (which, luckily, this one is). All the ports can accept one instruction per cycle.

So total, we have:

And of that total, the actual box pruning test (the first 5 x86 instructions) are 4 fused uops, 3 unfused p015 and 1 unfused p23 - a single cycle’s worth of work. In other words, we spend more than half of our execution bandwidth on loop overhead. That’s no good.

Hence, unroll 4x. With that, provided there *are* at least 4 boxes to test against in the current cluster, we end up with:

Our bottleneck are once again ports 0,1,5, but they now process 4 candidate pairs in 5.33 cycles worth of work, whereas they took 9.33 cycles worth of work before. So from that analysis, we expect something like a 42.8% reduction in execution time, theoretical. Actual observed reduction was 34.4% on my home i7-2600K (12038 Kcyc -> 7893 Kcyc) and 42.9% on my work i7-3770S (8990 Kcyc -> 5131 Kcyc). So the Sandy Bridge i7-2600K also runs into some other limits not accounted for in this (very simplistic!) analysis whereas the i7-3770S behaves *exactly* as predicted.

The other tweak I tried later was to switch things around so the box X coordinates are converted to integers. The issue is our 2-fused-uop COMISS, which we’d like to replace with a 1-fused-uop compare. Not only is the integer version fewer uops, the CMP uop is also p015 instead of the more constrained p0+p1 for COMISS.

What would we expect from that? Our new totals are:

From the back-of-the-envelope estimate, we now go from purely backend limited to simultaneously backend and frontend limited, and we’d expect to go from about 5.33 cycles/iter to 5 cycles/iter, for a 6.2% reduction.

And indeed, on my work i7-3770S, this change gets us from 5131 Kcyc -> 4762 Kcyc, reducing the cycle count by 7.2%. Close enough, and actually a bit better than expected!

This example happens to work out very nicely (since it has a lot of arithmetic and few branch mispredictions or cache misses), but the same general ideas apply elsewhere. Who says that out-of-order cores are so hard to predict?

—-

Right. Thank you Fabian. That was certainly a… rigorous explanation.

Here are a few comments that come to mind:

  • It is certainly true that manually pairing the assembly code for the U and V pipelines of the first Pentiums (which I did a lot in the past) was far more tedious than letting the out-of-order processors deal with it for me.
  • It boils down to operation counting indeed. That’s what we noticed in the previous posts: reducing the total number of instructions has a measurable impact on performance in this case.
  • I did try to restructure the loop to remove the jump from the hot path, but as you noticed as well it didn’t make any difference. But as a side-effect of another goal (reducing the size of the main loop), the hot path became jump-free in version 14a anyway.
  • Using integers is what we tried in version 7 already. While we did measure gains, they were too small to matter and we ignored them. That being said, version 7 took 40000+ KCycles… so the gains might have been small compared to the total cost at the time, but if we still get the same gains today it might be a different story. In other words, going from 5131 to 4762 K-Cycles is just a 369 K-Cycles gain: peanuts compared to 40000, but probably worth it compared to 4000. And yes, using integers for X’s only may also be a better idea than using them for everything. So we will revisit this and see what happens in the C++ version.

In any case, here are the timings for Ryg’s version on my machines:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

98822

0

0%

1.0

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version14a - VERSION3

9012

~3200

~26%

~10.96

Version14b – Ryg/Unsafe

7600

~4100

~35%

~13.00

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version14a - VERSION3

8532

~1500

~15%

~10.88

Version14b – Ryg/Unsafe

7641

~2300

~23%

~12.15

The Delta and Speedup columns are computed between Ryg’s version and the previous best assembly version. The Timings and Overall X factor columns are absolute values that you can use to compare Ryg’s version to our initial C++ unrolled version (14a). The comparisons are not entirely apple-to-apple:

  • Versions 12 and 14b are “unsafe”, version 14a is “safe”.
  • Versions 12 and 14b are assembly, version 14a is C++.
  • Version 14b does more than unrolling the loop, it also switches some floats to integers.

So the comparisons might not be entirely “fair” but it doesn’t matter: they give a good idea of what kind of performance we can achieve in “ideal” conditions where the compiler doesn’t get in the way.

It gives a target performance number to reach.

And that’s perfect really because we just reached our previous performance target (10x!) in the previous post.

So we need a new one. Perfect timing to send me new timings.

Well, gentlemen, here it is: our new goal is to reach the same performance as Ryg’s unrolled assembly version, but using only C++ / intrinsics – to keep things somewhat portable.

This is what we will try to do next time.

Stay tuned!

What we learnt:

It is always a good idea to make your code public and publish your results. More often than not you get good feedback and learn new things in return. Sometimes you even learn how to make your code go faster.

Ex-scene coders rule. (Well we knew that already, didn’t we?)

GitHub code for part 14b

Box pruning revisited - Part 14a - loop unrolling

Friday, March 3rd, 2017

Part 14a – loop unrolling

The main codepath from version 13 had only 9 assembly instructions, and there wasn’t much room left for improvements in terms of reducing the instructions count. Nonetheless we saw that doing so had a measurable impact on performance. At the same time, we identified a problem with the code alignment of our main loop, which was apparently not properly handled by the compiler.

In that situation, a typical thing to do is unroll the loop. It reduces the amount of branch instructions needed to loop. And of course, misaligned loops are less of a problem if we loop less.

Loop unrolling is pretty much the oldest trick in the book, but it still has its place today. On the Atari ST for example, things were pretty simple: any removed instruction gave an immediate performance gain. There were no cache effects, the code always took a very predictable amount of time, and the more you unrolled, the faster it became. No surprises. This led to certain abuses where loops got unrolled thousands of times, creating huge blocks of code that quickly consumed all the available memory. In the old days of TASM on PC, the REPT / ENDM directives could be used to unroll the code and achieve similar speedups.

These days however, processors are a little bit more complicated, subtle, unpredictable. And loop unrolling is not always a win. In particular, if you unroll too much, the code can become slower. The previously mentioned Intel optimization manual has some details about this, in chapter 3.4.1.7. It became more difficult to foresee and explain, but the basic strategy remains as simple as in the past, as we will see just now.

Our code at the end of version 13 looked like this:

If we hide the safe/unsafe version business behind a trivial SIMD_OVERLAP_TEST macro, unrolling that loop twice can be as easy as simply copy-pasting the inner block, like this (VERSION1_UNROLL2 in the code):

This is as easy as it gets, and in this specific case it already provides performance gains! However, unroll it exactly the same way but three times (VERSION1_UNROLL3 in the code) and it becomes slower again. That’s a neat illustration of what we said just before: unrolling the loop too much can hurt performance. Here is a quick summary of timings for the office PC, safe version (new entries in bold letters):

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

Version13 - safe

10053

~2500

~20%

~9.23

Version14a - VERSION1_UNROLL2

9368

~685

~6%

~9.91

(Version14a - VERSION1_UNROLL3)

(10263)

-

-

-

So what is wrong with unrolling the loop three times here? Looking at the disassembly reveals a potential problem: the part that outputs the pair is replicated three times inside the loop, making the code quite large. The Intel optimization manual warns about this, for example stating that “unrolling can be harmful if the unrolled loop no longer fits in the trace cache”. Whether this is what actually happens here or not is not terribly relevant: the important point is that it gives us a new theory to test, and a new direction for our next attempt. Refactoring the code so that the block that outputs the pair moves out of the loop gives birth to version 2, here unrolled twice (VERSION2_UNROLL2):

As you can see it is very similar to the previous version: the only difference is that the part where an overlap is found has been clearly moved out of the way, before the loop even starts. When an overlap is found, we jump there and fallback to the start of the loop.

The presence of goto should not be frowned upon: as I’ve said before, you must see through the C++, and consider the generated assembly. Generally speaking the assembly is shock-full of jump instructions (i.e. goto) anyway. And after version 13 where we tried to mimic the desired assembly in C++, the appearance of goto here should not be a surprise.

In any case: it’s 2017, I don’t have the time or energy to discuss once again whether goto should be considered harmful. It’s a tool. It serves a purpose. If you can avoid it, do so, but if you cannot, don’t be ashamed. People for whom this is a deal-breaker can stick to version 13. Or maybe look out for a future version in a future post that would perhaps drop them – I don’t know yet.

Now, as we did with version 1, we can try to unroll the loop twice (VERSION2_UNROLL2) and three times (VERSION2_UNROLL3). The results for the office PC / safe version are:

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

Version13 - safe

10053

~2500

~20%

~9.23

Version14a - VERSION1_UNROLL2

9368

~680

~6%

~9.91

(Version14a - VERSION1_UNROLL3)

(10147)

-

-

-

Version14a - VERSION2_UNROLL2

9051

~1000

~10%

~10.26

Version14a - VERSION2_UNROLL3

8802

~1200

~12%

~10.55

(Version14a - VERSION2_UNROLL4)

(9638)

-

-

-

All “Version 14a” entries are compared to Version 13 here, not to each-other.

What we see seems to confirm the theory. Looking at the disassembly, the output-pair-related code did move out of the loop, making it tighter overall. As a result, unrolling version 2 three times still gives performance gains. But if we try to unroll it one more time (VERSION2_UNROLL4), we see the performance drop again. It’s the same pattern as with version 1: at some point the unrolled code apparently becomes too large and starts to be slower.

In any case, the assembly for VERSION2_UNROLL3 looks pretty good:

01103002 comiss xmm1,dword ptr [edi+esi]
01103006 jb LoopStart+96h (01103098h)
0110300C movaps xmm0,xmmword ptr [ecx+esi*2]
01103010 cmpnltps xmm0,xmm2
01103014 movmskps eax,xmm0
01103017 cmp eax,0Fh
0110301A je LoopStart+51h (01103053h)
0110301C add esi,8

0110301F comiss xmm1,dword ptr [edi+esi]
01103023 jb LoopStart+96h (01103098h)
01103025 movaps xmm0,xmmword ptr [ecx+esi*2]
01103029 cmpnltps xmm0,xmm2
0110302D movmskps eax,xmm0
01103030 cmp eax,0Fh
01103033 je LoopStart+51h (01103053h)
01103035 add esi,8

01103038 comiss xmm1,dword ptr [edi+esi]
0110303C jb LoopStart+96h (01103098h)
0110303E movaps xmm0,xmmword ptr [ecx+esi*2]
01103042 cmpnltps xmm0,xmm2
01103046 movmskps eax,xmm0
01103049 cmp eax,0Fh
0110304C je LoopStart+51h (01103053h)
0110304E add esi,8

01103051 jmp LoopStart (01103002h)

It is color-coded the same way as in previous posts, to show where all the bits and pieces moved. The compiler did exactly what we wanted, unrolling the code three times in the obvious way. As expected, the green block from version 13 (the part that outputs pairs) moved out of the loop. Perhaps as a result, the compiler finally stopped spilling our registers to the stack, and we now use 8 instructions per test (as opposed to 9 in version 13).

So the good news is that we found an easy way to make the code faster. And in fact, this version is a significant milestone: we just reached our 10X speedup target!

The bad news is that this code size limit might vary from one platform to the next, so without testing on each of them there is no guarantee that what we are doing here actually helps. It might very well be the case that VERSION2_UNROLL3 is faster on PC for example, but slower on some console – or on some other PC with another processor. In that respect, loop unrolling is a perhaps more fragile and debatable optimization than things like avoiding cache misses, etc, which are more universally beneficial. Still, we started the questionable optimizations with version 13 already, and this isn’t worse.

To test various amount of unrolling more easily on each platform, it is perhaps useful to capture the common bits in a macro. This is what VERSION2_UNROLL does, and it admittedly starts to look slightly evil:

The last goto has been added because the initial while directive disappeared, replaced with the if within the BLOCK macro. Amazingly, this “innocent change” is enough to significantly change the generated assembly. Fortunately it doesn’t change the timings - this form is perhaps even a tiny bit faster:

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

Version13 - safe

10053

~2500

~20%

~9.23

Version14a - VERSION1_UNROLL2

9368

~680

~6%

~9.91

(Version14a - VERSION1_UNROLL3)

(10147)

-

-

-

Version14a - VERSION2_UNROLL2

9051

~1000

~10%

~10.26

Version14a - VERSION2_UNROLL3

8802

~1200

~12%

~10.55

(Version14a - VERSION2_UNROLL4)

(9638)

-

-

-

Version14a - VERSION2_UNROLL

8738

~1300

~13%

~10.63

It looks quite fine.

However… running the same tests on the home PC exposes the main issue with loop unrolling:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

98822

0

0%

1.0

Version13 - safe

12236

~2200

~15%

~8.07

Version14a - VERSION1_UNROLL2

10573

~1600

~13%

~9.34

Version14a - VERSION1_UNROLL3

11388

~840

~7%

~8.67

Version14a - VERSION2_UNROLL2

11290

~940

~7%

~8.75

Version14a - VERSION2_UNROLL3

10425

~1800

~14%

~9.47

Version14a - VERSION2_UNROLL4

9598

~2600

~21%

~10.29

Version14a - VERSION2_UNROLL

9691

~2500

~20%

~10.19

The results are similar…. yet so subtly different. In particular:

  • Unrolling version 2 four times (VERSION2_UNROLL4) still provides clear performance gains, while it made the code slower on the office PC.
  • VERSION2_UNROLL4 and VERSION2_UNROLL are almost the same speed… even though one is unrolled one more time! In theory VERSION2_UNROLL should be equivalent to VERSION2_UNROLL3 (as on the office PC). But for some unknown reason it is clearly faster here.

These observations gave birth to a new version, VERSION3, which copies VERSION2_UNROLL but unrolls the loop five times – the unrolling limit on the home PC.

And lo and behold…

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

98822

0

0%

1.0

Version13 - safe

12236

~2200

~15%

~8.07

Version14a - VERSION1_UNROLL2

10573

~1600

~13%

~9.34

Version14a - VERSION1_UNROLL3

11388

~840

~7%

~8.67

Version14a - VERSION2_UNROLL2

11290

~940

~7%

~8.75

Version14a - VERSION2_UNROLL3

10425

~1800

~14%

~9.47

Version14a - VERSION2_UNROLL4

9598

~2600

~21%

~10.29

Version14a - VERSION2_UNROLL

9691

~2500

~20%

~10.19

Version14a - VERSION3

9012

~3200

~26%

~10.96

That’s quite a speedup! I cannot explain it but I’ll take it. Running this new version on the office PC gives:

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

Version2 - base

92885

0

0%

1.0

Version13 - safe

10053

~2500

~20%

~9.23

Version14a - VERSION1_UNROLL2

9368

~680

~6%

~9.91

(Version14a - VERSION1_UNROLL3)

(10147)

-

-

-

Version14a - VERSION2_UNROLL2

9051

~1000

~10%

~10.26

Version14a - VERSION2_UNROLL3

8802

~1200

~12%

~10.55

(Version14a - VERSION2_UNROLL4)

(9638)

-

-

-

Version14a - VERSION2_UNROLL

8738

~1300

~13%

~10.63

Version14a - VERSION3

8532

~1500

~15%

~10.88

So the gains for VERSION3 are minimal on this machine, but we got lucky: at least this version isn’t slower. Sometimes you end up with different “best” versions on different machines. That is probably the biggest problem with loop unrolling: until you test “everywhere”, selecting a “winner” is a bit like a shot in the dark.

What one can do in difficult cases is to look at the numbers: sometimes the best version on one machine is still slower (in absolute number of cycles) than sub-optimal versions on another machine. So one strategy here is to select the fastest version on the slowest machine, and accept whatever results it gives on the faster machine. That way you make sure that your choices don’t penalize the machines that need the optimizations the most.

In our case here, we got lucky anyway: VERSION3 wins, and this is the version we will continue to work with.

But what is there left to do anyway?

Well, looking at the previous disassembly, another obvious thing to try pops up: increasing esi for each box seems useless. We could use the relevant offsets in the address calculations, increase esi only once in the end, thus saving some extra instructions and making the code tighter.

That’s the plan for the next post. However, before tackling this in the C++ version, we will revisit the assembly version thanks to a special guest.

What we learnt:

Loop unrolling still has its uses in 2017.

But it is a lot trickier than in the past. The optimal amount of unrolling varies from one machine to the next, and unrolling too much can actually make the code run slower.

We reached our 10X speedup target!

GitHub code for part 14a

Box pruning revisited - part 14 (preview)

Friday, February 17th, 2017

Ski holiday for me!

Here’s a preview of part 14 to keep you occupied meanwhile.

Box pruning revisited - part 13 - code alignment

Friday, February 17th, 2017

Part 13 – code alignment

Last time we rewrote the main loop in assembly, and found a way to make it run faster.

Knowing what we know now, can we rewrite the C++ version and get the same performance gains? Can we make it generate what we want? That is what we will explore today.

In the assembly version our main new trick was to use the same offset to index both box arrays, and then reuse that same offset again to lazy-evaluate one of the box indices when an overlap occurs.

In the C++ version this looks like a rather nebulous affair since, well, there was only one index in the first place:

It’s hard to see how to “improve” this.

But as I often say, you should not focus on what the C++ code looks like. You should see through it, and think in terms of generated assembly. In that respect, that code can be improved.

We know from last time that this single “Index1++” does in fact generate three separate “adds” used for three different things. To avoid this, take a leap of faith: start from the assembly and translate it back to C++ in the most direct way.

————–

These lines from version 12:

Now become:

————–

These:

Now become:

————–

The single offset:

xor ecx, ecx

Becomes:

udword Offset = 0;

————–

The loop:

Is transformed back into a regular while loop, using the single offset:

————–

The SIMD overlap test:

Now uses _mm_cmplt_ps in the C++ code as well, with swapped arguments, and our single offset again:

————–

And finally the part that outputs the pair:

Is reproduced in C++ this way:

…with “outputPair” the same non-inlined function as in the assembly code.

————–

The resulting C++ code is thus:

It looks a bit strange, and I suppose some would say it’s not “safe”, but….

Does it make any difference?

Home PC:

Complete test (brute force): found 11811 intersections in 781701 K-cycles.
14495 K-cycles.
13701 K-cycles.
13611 K-cycles.
13597 K-cycles.
13611 K-cycles.
13600 K-cycles.
13597 K-cycles.
14091 K-cycles.
13718 K-cycles.
13599 K-cycles.
13728 K-cycles.
13635 K-cycles.
13596 K-cycles.
13598 K-cycles.
13638 K-cycles.
13628 K-cycles.
Complete test (box pruning): found 11715 intersections in 13596 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 807508 K-cycles.
13324 K-cycles.
12650 K-cycles.
14015 K-cycles.
12647 K-cycles.
12646 K-cycles.
13743 K-cycles.
12648 K-cycles.
12647 K-cycles.
13057 K-cycles.
13643 K-cycles.
12684 K-cycles.
12648 K-cycles.
13095 K-cycles.
13330 K-cycles.
12645 K-cycles.
12648 K-cycles.
Complete test (box pruning): found 11715 intersections in 12645 K-cycles.

Not really.

Let’s recap where we stand now:

On the home PC we see that version 13 is perhaps a tiny bit better than version 11 now, but not as good as the assembly version:

C++ (version 11, unsafe)

14372

Assembly (version 12, unsafe)

11731

C++ (version 13, unsafe)

13596

But on the office PC, it’s a wash:

C++ (version 11, unsafe)

12570

Assembly (version 12, unsafe)

10014

C++ (version 13, unsafe)

12645

Was it all for nothing then?

Wait, wait. Don’t forget to check the disassembly:

00C93020 cmpltps xmm1,xmmword ptr [ecx+esi*2]
00C93025 movmskps eax,xmm1
00C93028 cmp eax,0Ch
00C9302B jne CompleteBoxPruning+292h (0C93052h)
00C9302D push dword ptr [esp+20h]
00C93031 mov eax,edi
00C93033 push dword ptr [pairs]
00C93036 sub eax,edx
00C93038 add eax,esi
00C9303A sar eax,3
00C9303D push eax
00C9303E push dword ptr [esp+3Ch]
00C93042 call outputPair (0C92580h)
00C93047 mov edx,dword ptr [esp+24h]
00C9304B mov ecx,dword ptr [esp+44h]
00C9304F add esp,10h
00C93052 movss xmm0,dword ptr [esp+38h]
00C93058 movaps xmm1,xmmword ptr [esp+40h]
00C9305D add esi,8
00C93060 comiss xmm0,dword ptr [edi+esi]
00C93064 jae CompleteBoxPruning+260h (0C93020h)

Well at least it does look quite nicer and tighter than before.

The three blocks have been color-coded like in previous versions. The main codepath is only 9 instructions (against 13 in version 11 and 8 in version 12).

There is a single add (00C9305D) in the third block, instead of three before: that’s what we wanted. In that respect, the strategy worked: we did trick the compiler into generating a single add!

But unfortunately both xmm0 and xmm1 are constantly reloaded from the stack in the third block, and that is probably what kills it. Ignore these two instructions, and the code is pretty much what we wanted.

The assembly version (our target code) was like this:

movaps xmm3, xmm2
cmpltps xmm3, xmmword ptr [edx+ecx*2]
movmskps eax, xmm3

And if we move one line from the third block we see that we just got:

00C93058 movaps xmm1,xmmword ptr [esp+40h]
00C93020 cmpltps xmm1,xmmword ptr [ecx+esi*2]
00C93025 movmskps eax,xmm1

So close. But no cigar.

Now what? How do you tell the compiler to stop spilling xmm registers to the stack?

Well: I don’t know. There is probably no way.

But you can try the same strategy as before: when you’re stuck, try random changes. Mutate the code. In particular, you can replace this:

if(_mm_movemask_ps(_mm_cmplt_ps(b, _mm_load_ps(box)))==12)

With that:

if(_mm_movemask_ps(_mm_cmpnle_ps(_mm_load_ps(box), b))==12)

It looks like a rather innocent change, but amazingly, it generates the following assembly:

00AE2FA0 movaps xmm0,xmmword ptr [ecx+esi*2]
00AE2FA4 cmpnleps xmm0,xmm1
00AE2FA8 movmskps eax,xmm0
00AE2FAB cmp eax,0Ch
00AE2FAE jne CompleteBoxPruning+29Ah (0AE2FDAh)
00AE2FB0 push dword ptr [esp+20h]
00AE2FB4 mov eax,edi
00AE2FB6 push dword ptr [pairs]
00AE2FB9 sub eax,edx
00AE2FBB add eax,esi
00AE2FBD sar eax,3
00AE2FC0 push eax
00AE2FC1 push dword ptr [esp+3Ch]
00AE2FC5 call outputPair (0AE2500h)
00AE2FCA movaps xmm1,xmmword ptr [esp+50h]
00AE2FCF mov edx,dword ptr [esp+24h]
00AE2FD3 mov ecx,dword ptr [esp+44h]
00AE2FD7 add esp,10h
00AE2FDA movss xmm0,dword ptr [esp+38h]
00AE2FE0 add esi,8
00AE2FE3 comiss xmm0,dword ptr [edi+esi]
00AE2FE7 jae CompleteBoxPruning+260h (0AE2FA0h)

One of the loads from the stack vanished: it moved from 00C93058 in the previous version to 00AE2FCA now, i.e. it moved from the frequent codepath to the infrequent one. The arguments swap allowed the compiler to do so, because xmm1 is not destroyed by the comparison anymore. The resulting code is now almost exactly the same as our assembly version: the only difference is that xmm0 is reloaded from the stack in the third block (00AE2FDA), while it should instead be kept in a constant xmm register.

This doesn’t look like much, but this innocent change produces a measurable performance gain compared to the version just before. Our revisited C++ version, while still not as neat as the assembly version, is now clearly faster than what we got in version 11:

Home PC (unsafe version):

Complete test (brute force): found 11811 intersections in 782053 K-cycles.
12972 K-cycles.
12162 K-cycles.
12152 K-cycles.
12152 K-cycles.
12422 K-cycles.
12207 K-cycles.
12154 K-cycles.
12151 K-cycles.
12174 K-cycles.
12153 K-cycles.
12148 K-cycles.
12398 K-cycles.
12184 K-cycles.
12152 K-cycles.
12150 K-cycles.
13039 K-cycles.
Complete test (box pruning): found 11725 intersections in 12148 K-cycles.

Office PC (unsafe version):

Complete test (brute force): found 11811 intersections in 815031 K-cycles.
11409 K-cycles.
10029 K-cycles.
11243 K-cycles.
10726 K-cycles.
10055 K-cycles.
10784 K-cycles.
10588 K-cycles.
10548 K-cycles.
10290 K-cycles.
10029 K-cycles.
10408 K-cycles.
10464 K-cycles.
10030 K-cycles.
10475 K-cycles.
10028 K-cycles.
10028 K-cycles.
Complete test (box pruning): found 11725 intersections in 10028 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

(Version12 - 1st)

(14309)

-

-

~6.90

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version13 - unsafe

12148

~2200

~15%

~8.13

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

(Version12 – 1st)

12917

-

-

~7.19

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version13 - unsafe

10028

~2500

~20%

~9.26

The gains are computed between version 13 and version 11. Version 12 is now ignored: we can do pretty much the same without the need for assembly.

But doing the assembly version allowed us to find this hidden optimization. And I had to present the assembly version first, to explain where this was coming from. If I would have switched from version 11 to version 13 directly, it would have looked like black magic.

Now, we only got an “unsafe” version 13, since it just got converted from version 12, for which we only had an “unsafe” version. Fine. So we complete the C++ code as we did before e.g. in version 11, for the safe version. The only change in the inner loop is this:

This generates the proper number of pairs, and the disassembly is exactly the same as for the unsafe version, except cmpnleps is replaced with cmpnltps. But it is otherwise the exact same assembly, same instructions, no extra spilling to the stack, all the same.

And yet…

Office PC (safe version):

Complete test (brute force): found 11811 intersections in 805609 K-cycles.
12524 K-cycles.
11874 K-cycles.
11726 K-cycles.
11726 K-cycles.
11941 K-cycles.
11884 K-cycles.
12073 K-cycles.
11725 K-cycles.
11756 K-cycles.
12752 K-cycles.
12267 K-cycles.
12534 K-cycles.
12274 K-cycles.
12588 K-cycles.
11726 K-cycles.
12184 K-cycles.
Complete test (box pruning): found 11811 intersections in 11725 K-cycles.

Madness. Why is this one slower?

I came back and forth between the assembly and the C++, looking for something I could have overlooked. And then I saw it.

The start of the most inner loop was not aligned on a 16-bytes boundary.

That…. was weird. I could have sworn that the compiler was always aligning the loops. I mean that’s where all these countless nops and lea esp,[esp] instructions come from. I see them all the time in the disassembly. That’s what you’re supposed to do, right? I had this rule in mind but was suddenly unsure where it was coming from. A quick bit of googling revealed it was still a perfectly valid rule, see for example the Intel optimization manual, chapter 3.4.1.5, it says right there:

All branch targets should be 16-byte aligned.

But the compiler apparently does not do that. Or at least: not always. I have no idea why. But that was my new theory anyway: the safe version is slower because the loop is not aligned.

Unfortunately I don’t know how to tell the compiler to align a loop in C++. So just for testing, I added back the same line as in the assembly version before the loop:

_asm align 16

“You’ll never believe what happened next!”

Office PC (safe version):

Complete test (brute force): found 11811 intersections in 846230 K-cycles.
11174 K-cycles.
10626 K-cycles.
10375 K-cycles.
10549 K-cycles.
10053 K-cycles.
10340 K-cycles.
10727 K-cycles.
10709 K-cycles.
10536 K-cycles.
10643 K-cycles.
10578 K-cycles.
10379 K-cycles.
11036 K-cycles.
10793 K-cycles.
10462 K-cycles.
10993 K-cycles.
Complete test (box pruning): found 11811 intersections in 10053 K-cycles.

At that point I was starting to sweat.

Because provided the analysis is correct and this is not just because of something entirely different that I cannot see, then:

  • I have no idea how to fix this without using at least one line of assembly.
  • I have no idea if this random alignment affected the previous versions, whose timings may all be questionable now.

In fact, we had a bit of an unsolved mystery at the end of version 9c, where for some reason the safe & unsafe versions at home showed a measurable performance difference. Rings a bell? That might have been because of this. The safe version adds two instructions in the code flow before the loop starts, so that’s enough to change its alignment…

With the enforced loop alignment, both safe and unsafe versions show the same timings both at home and in the office. Having to use one line of assembly is a bit unfortunate since I was trying to convert back everything to something more portable. That being said, this issue is obviously very compiler-dependent, so maybe that extra line doesn’t need porting – other compilers on other platforms might do the right thing immediately. Who knows?

Not a rhetorical question: seriously, who knows about that stuff? Email me and tell me.

In any case, after adding that line the timings become:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

(Version12 - 1st)

(14309)

-

-

~6.90

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version13 - unsafe

12296

~2000

~14%

~8.03

Version13 - safe

12236

~2200

~15%

~8.07

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

(Version12 – 1st)

12917

-

-

~7.19

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version13 - unsafe

10168

~2400

~19%

~9.13

Version13 - safe

10053

~2500

~20%

~9.23

I will freely admit that for this post, I have no big confidence in my analysis. Forcing the compiler to align the loop seems to fix the issue and provide consistent results.

If nothing else, this whole post shows that one can sometimes tame the compiler and make it generate something closer to the ideal. Even though it’s difficult. And fragile. And not easily maintainable. And results will vary from compiler to compiler. Yeah, ok, fine: it’s as much luck as engineering at this point. All of this is true.

Still, converting the assembly back to C++ remains a fascinating experiment, and the results are there. You may not like the way the C++ code looks now, but this version is almost as fast as the hand-written assembly version.

Pheew.

Optimization: that’s a 24/7 job. Just look at how much time we can spend on one simple function…

What we learnt:

The assembly version showed us the way. While technically it still “wins”, the compiler-generated version stroke back and came very close in terms of performance. The generated code is almost the same as our hand-written assembly.

We reached a point where a single extra movaps in the loop has a measurable cost.

The compiler does not seem to always align loops, so some of them are effectively randomly aligned. Any extra instruction before the loop in the code flow can change the alignment and thus affect performance.

Thus there is no such thing as an “innocent change”.

Are we there yet, Papa Smurf?

No, I’m afraid we are not.

GitHub code for part 13.

Box pruning revisited - part 12 - ASM FTW

Wednesday, February 15th, 2017

Part 12 – ASM FTW

Fasten your seatbelts; this one is going to be a bit bumpy. Last time we were left with unanswered questions and doubts about the quality of the compiler-generated code.

Those who know me probably saw it coming from a mile away: this time we’ll rewrite the loop in assembly. Old-school, baby.

It is not always useful, but it is often informative and eye-opening.

To make things easier, we will limit ourselves to a small part of the loop. We are not going to rewrite the whole function. Just this snippet:

And we’re only going to do the “unsafe” version, since that’s (slightly) less work.

The easiest way to do the conversion is to use inline assembly in Visual Studio (you know now why this whole experiment is using a Win32 project - inline assembly is not supported anymore in Win64 builds).

You just comment out lines and start replacing them with blocks of _asm code. The assembly code there can directly access some of the C++ variables, it’s all rather simple. Sort of.

For example this:

Becomes:

Then this:

Becomes for example:

And once the loop is done and you double-checked it iterates the proper number of times (it’s easy to compare against the C code when both are right there in the same file), you just add the missing bits.

The SIMD overlap test:

…will need some code before the loop to compute the initial address:

And then the actual test inside the loop will be something like:

At this point you don’t immediately write the code to output the pair. First you just increase a counter (using a free register like maybe ebx) and you count the number of overlaps you get. Then you printf the results and compare it to the C code. Is the number correct? Good, then you can now complete the function and actually output the pair. Writing a push_back in assembly is tedious and pointless, so you don’t actually need to do it. Just move the corresponding code to a C function like this:

And then call it from the assembly block, like this:

So there are a couple of subtleties in there.

The called function must have a __cdecl signature, because we sent parameters to it via the stack. Alternatively, compile the whole project with the /Gd compile option (it’s the default anyway).

If you omit __cdecl in the function signature and compile the whole program with the __fastcall convention (/Gr compile option), then it will crash when trying to call “outputPair“.

The parameters are passed in reverse order, that’s the convention. Note how you can directly use the Remap, pairs and RIndex0 variables from the assembly code. They’re all C++ variables, the compiler knows what to do. It’s very convenient. After the call the “add esp, 16” fixes the stack pointer - it’s 16 because we pushed four 4-bytes elements to the stack before calling the function.

Now the function call is wrapped by a pushad/popad couple. It’s the lazy man’s way to save and restore all registers to/from the stack. This makes sure that whatever happens in outputPair, it’s not going to modify the values kept in our CPU registers.

And then the movaps on xmm1 and xmm2 do the same save & restore for the two xmm registers we care about. These ones are easy to forget because most of the time outputPair does not change the xmm registers… but it does when there is a re-allocation within the container. If you are not aware of that, this can be a hard-to-find bug.

Right.

If you put all this together you get this first, “naive” version:

Let’s see how it fares against the questions raised in part 11.

Question 1 was: why write the code in such a way that it destroys the constant box?

Answer: because we don’t have a choice. We do the same in our assembly version: xmm3 is also the constant box, and it’s also destroyed by the cmpltps instruction, exactly like in the compiler-generated version. It turns out that there is no alternative. We used this intrinsic:

_mm_cmpgt_ps

But what we got in the assembly was:

cmpltps

Do you spot the difference? It’s “greater than” in the intrinsic, and “lower than” in the actual assembly. It’s the intrinsic trap again. You can see it here for example. The intrinsic is marked as CMPLTPSr, which means there is no actual cmpgtps instruction: it does not exist. So the compiler uses the “reverse” instruction instead (CMPLTPS) and swaps the arguments. That gives birth to the code we saw.

Beyond that, even if cmpgtps would exist, one way or another the comparison instruction would overwrite the contents of one register. It can be either our constant box, or the freshly loaded data for the new box. And both alternatives come down to the same amount of instructions: one load, and one cmp. We cannot do less.

So, it’s a wash. The compiler was not to blame here, our question was simply naïve.

———-

Question 2 was: why reload the constant box from the stack? Why not keeping it in a register? The whole loop only uses two xmm registers (!) so it’s not like there’s a shortage of them.

=> We fixed this. The constant box is now kept in xmm2, which is only saved and restored when an overlap occurs. We win.

———-

Question 3 was: why do you reload it from the stack each time? It’s a constant for the whole loop. There are free available xmm registers. Just put it in xmm2, keep it there, and the “movss” instruction above just vanishes.

=> We fixed this. This was about MaxLimit, which is now kept in xmm1. The “movss” disappeared indeed. We win.

———-

Question 4 was about the BoxListX array being constantly written to the stack (and sometimes reloaded from it, with VC10).

=> We fixed this. It’s now kept in the esi register at all time. We win.

Question 5 was: why do you save edx to the stack all the time (address 003E3028)? Why don’t you just save and restore it only when an overlap actually occurs?

Question 6 was: same question for ecx.

=> We fixed this. We don’t write any CPU registers to the stack, unless an overlap occurs. We win.

———-

Sounds good!

The most often used codepath is now 10 instructions, compared to 13 before. Exciting!

But then you run the benchmark and…

Home PC:

Complete test (brute force): found 11811 intersections in 781774 K-cycles.
15193 K-cycles.
14574 K-cycles.
14399 K-cycles.
14310 K-cycles.
14619 K-cycles.
14366 K-cycles.
14310 K-cycles.
14322 K-cycles.
14310 K-cycles.
14310 K-cycles.
14309 K-cycles.
14740 K-cycles.
14346 K-cycles.
14311 K-cycles.
14375 K-cycles.
14374 K-cycles.
Complete test (box pruning): found 11715 intersections in 14309 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 812640 K-cycles.
13814 K-cycles.
13636 K-cycles.
13544 K-cycles.
13745 K-cycles.
13547 K-cycles.
13610 K-cycles.
13549 K-cycles.
13423 K-cycles.
12929 K-cycles.
13498 K-cycles.
13052 K-cycles.
13213 K-cycles.
12917 K-cycles.
13166 K-cycles.
13510 K-cycles.
13817 K-cycles.
Complete test (box pruning): found 11715 intersections in 12917 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Version12

14309

-

-

~6.90

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version12

12917

-

-

~7.19

…it’s pretty much exactly the same speed as before, or even marginally slower.

You didn’t think it would be that easy, did you?

It’s not.

But it was important to go through this process: it gives a better understanding of why the compiler made certain choices, and it shows why conventional wisdom says that it is hard to beat the compiler these days. Most of the time, the hand-written version is indeed slower than the compiler-generated version.

That would be the conclusion you would stop at, if this would be the conclusion you wanted to reach.

Well…

Old habits die hard and ex-demomakers cannot stop here I’m afraid.

The reality is that you cannot do worse than the compiler: just start from the compiler-generated version. How can you do worse if you copy what it did and improve from there?

The real benefit of the assembly version is that it opens the door for more tinkering. It’s pretty hard (if not impossible) to make the compiler generate exactly what you want, to create various experiments. So for example you cannot easily remove a specific instruction in the code flow to check its impact on performance. But the assembly version lets you do whatever you want, without interference.

And we’re certainly going to exploit this.

What we did see so far, fair enough, is that our naive ideas about this code were wrong: keeping things in registers and avoiding spilling to the stack is apparently not such a big deal. This much is very true: these days it’s very difficult to foresee how a change will affect performance. But that is exactly why the assembly version helps: it allows you to do very precise experiments, and see the impact of each line.

For example you can comment out the part that outputs the pairs, without the optimizing compiler removing the whole SIMD test when it discovers that it now can.

Do that and you discover that the timings barely change. These writes to output overlapping pairs are virtually free.

Continue doing this meticulously, and you will soon reach that one:

inc edi // Index1++

Comment it out, and…

Complete test (brute force): found 11811 intersections in 822828 K-cycles.
11727 K-cycles.
10996 K-cycles.
11431 K-cycles.
10930 K-cycles.
10984 K-cycles.
11271 K-cycles.
11632 K-cycles.
11412 K-cycles.
11621 K-cycles.
12076 K-cycles.
11480 K-cycles.
11624 K-cycles.
11244 K-cycles.
10963 K-cycles.
11492 K-cycles.
11074 K-cycles.
Complete test (box pruning): found 11715 intersections in 10930 K-cycles.

Wo-ah.

WO-AH.

This is now significantly faster, even though edi is only used when we output pairs. So it could be that not increasing edi removes some cache misses, because we use it as an index in the remap table, and if it’s constant we remap the same index all the time, so, less cache misses. Except we just saw that removing the whole block of code writing out the pairs barely had an impact on performance. Hmmm.

So, scratching head, observing, formulating a theory: is that because we use “inc” instead of “add” ? Testing: nope, replacing “inc edi” with “add edi, 1” doesn’t change anything.

More theories led to more tests that didn’t reveal anything. For some reason, increasing edi there is costly. I admit I still don’t know why.

But at the end of the day, the “why” doesn’t matter much. At this level it’s a bit like quantum mechanics: there are a lot of weird rules at play, weird things happen, and trying to explain “why” in an intuitive way doesn’t always work. What matters more is that the assembly version allowed us to discover a weakness in the code, and we would never have seen that in the C++ version, because it was only:

Index1++;

Which gave birth to three different adds:

003E301C add ecx,8
003E301F add edx,4
003E3022 add edi,10h

We don’t have any control over these assembly adds in the C++ code, because there’s only one line there, and we cannot just remove it without breaking the whole loop. The assembly version however gives us control over the atomic instructions, for more fine-grained experiments. Experimenting at C++ level is like playing the piano with boxing gloves: you end up creating noise rather than music.

Note also how that one was found by accident: random changes. Random mutations, as if we’d be running a genetic algorithm on the code sequence. Random jumps to get out of a local minimum. I remember a competition that once took place in the office between me and a co-worker. At some point I came up with a counter-intuitive piece of code, full of branches, that turned out to be faster than all the branchless versions we had tried before. That co-worker told me something I’d never forget: “I would never have tried that!“. It was said with a tone clearly expressing his disgust at what looked like a terrible idea and a terrible piece of code. And that’s the lesson here: assumptions are the mother of all fuckups, you know that quote. When you’re that close to the metal, sometimes it’s good to forget what you think you know, and just try things. Even stupid things. Especially stupid things.

Sometimes, like here, it works. Increasing edi is slow? Fine, I don’t need to know why: I just won’t do it.

With this clear goal in mind, writing a new version doesn’t take long.

The new strategy is to use a single offset (ecx) to address both the BoxListX and BoxListYZ arrays. Then, when an overlap occurs, we recompute the box index from that single offset, instead of increasing a counter each time. As a result, we can replace the inc / add / add from the previous version with just one add. The most used codepath is now only 8 instructions (compared to the 13 from the compiler-generated version), and most importantly we don’t touch edi anymore.

So, did that work this time?

Yep:

Home PC:

Complete test (brute force): found 11811 intersections in 781991 K-cycles.
15514 K-cycles.
11767 K-cycles.
11733 K-cycles.
11734 K-cycles.
12045 K-cycles.
11758 K-cycles.
11737 K-cycles.
11736 K-cycles.
11748 K-cycles.
11744 K-cycles.
11733 K-cycles.
11736 K-cycles.
11755 K-cycles.
11758 K-cycles.
11736 K-cycles.
11731 K-cycles.
Complete test (box pruning): found 11725 intersections in 11731 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 820108 K-cycles.
10815 K-cycles.
10923 K-cycles.
10528 K-cycles.
10509 K-cycles.
10804 K-cycles.
10524 K-cycles.
10921 K-cycles.
10027 K-cycles.
10815 K-cycles.
10792 K-cycles.
10019 K-cycles.
10016 K-cycles.
10983 K-cycles.
10016 K-cycles.
10495 K-cycles.
10014 K-cycles.
Complete test (box pruning): found 11725 intersections in 10014 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Version12 - first

14309

-

-

~6.90

Version12 - second

11731

~2600

~18%

~8.42

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version12 - first

12917

-

-

~7.19

Version12 - second

10014

~2500

~20%

~9.27

There you go! We’re getting close to a 10X speedup…

Who said hand-written assembly cannot beat the compiler?

Now it would be tempting to be smug about it, conclude that “assembly rules” or something, that compilers are “lame”, or any of the many things ex demo-coders are fond of saying.

But things in the real world are not that black-or-white, as I will demonstrate in the next post.

Stay tuned. We are not done.

What we learnt:

An assembly version is often useful to let us play with the code, experiment, and discover potential performance gains that wouldn’t have been easy to spot with the C++ version.

Hand-written assembly can still be faster than the compiler-generated version. Or so it seems.

GitHub code for part 12

shopfr.org cialis