Box pruning revisited - part 14 (preview)

February 17th, 2017

Ski holiday for me!

Here’s a preview of part 14 to keep you occupied meanwhile.

Box pruning revisited - part 13 - code alignment

February 17th, 2017

Part 13 – code alignment

Last time we rewrote the main loop in assembly, and found a way to make it run faster.

Knowing what we know now, can we rewrite the C++ version and get the same performance gains? Can we make it generate what we want? That is what we will explore today.

In the assembly version our main new trick was to use the same offset to index both box arrays, and then reuse that same offset again to lazy-evaluate one of the box indices when an overlap occurs.

In the C++ version this looks like a rather nebulous affair since, well, there was only one index in the first place:

It’s hard to see how to “improve” this.

But as I often say, you should not focus on what the C++ code looks like. You should see through it, and think in terms of generated assembly. In that respect, that code can be improved.

We know from last time that this single “Index1++” does in fact generate three separate “adds” used for three different things. To avoid this, take a leap of faith: start from the assembly and translate it back to C++ in the most direct way.

————–

These lines from version 12:

Now become:

————–

These:

Now become:

————–

The single offset:

xor ecx, ecx

Becomes:

udword Offset = 0;

————–

The loop:

Is transformed back into a regular while loop, using the single offset:

————–

The SIMD overlap test:

Now uses _mm_cmplt_ps in the C++ code as well, with swapped arguments, and our single offset again:

————–

And finally the part that outputs the pair:

Is reproduced in C++ this way:

…with “outputPair” the same non-inlined function as in the assembly code.

————–

The resulting C++ code is thus:

It looks a bit strange, and I suppose some would say it’s not “safe”, but….

Does it make any difference?

Home PC:

Complete test (brute force): found 11811 intersections in 781701 K-cycles.
14495 K-cycles.
13701 K-cycles.
13611 K-cycles.
13597 K-cycles.
13611 K-cycles.
13600 K-cycles.
13597 K-cycles.
14091 K-cycles.
13718 K-cycles.
13599 K-cycles.
13728 K-cycles.
13635 K-cycles.
13596 K-cycles.
13598 K-cycles.
13638 K-cycles.
13628 K-cycles.
Complete test (box pruning): found 11715 intersections in 13596 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 807508 K-cycles.
13324 K-cycles.
12650 K-cycles.
14015 K-cycles.
12647 K-cycles.
12646 K-cycles.
13743 K-cycles.
12648 K-cycles.
12647 K-cycles.
13057 K-cycles.
13643 K-cycles.
12684 K-cycles.
12648 K-cycles.
13095 K-cycles.
13330 K-cycles.
12645 K-cycles.
12648 K-cycles.
Complete test (box pruning): found 11715 intersections in 12645 K-cycles.

Not really.

Let’s recap where we stand now:

On the home PC we see that version 13 is perhaps a tiny bit better than version 11 now, but not as good as the assembly version:

C++ (version 11, unsafe)

14372

Assembly (version 12, unsafe)

11731

C++ (version 13, unsafe)

13596

But on the office PC, it’s a wash:

C++ (version 11, unsafe)

12570

Assembly (version 12, unsafe)

10014

C++ (version 13, unsafe)

12645

Was it all for nothing then?

Wait, wait. Don’t forget to check the disassembly:

00C93020 cmpltps xmm1,xmmword ptr [ecx+esi*2]
00C93025 movmskps eax,xmm1
00C93028 cmp eax,0Ch
00C9302B jne CompleteBoxPruning+292h (0C93052h)
00C9302D push dword ptr [esp+20h]
00C93031 mov eax,edi
00C93033 push dword ptr [pairs]
00C93036 sub eax,edx
00C93038 add eax,esi
00C9303A sar eax,3
00C9303D push eax
00C9303E push dword ptr [esp+3Ch]
00C93042 call outputPair (0C92580h)
00C93047 mov edx,dword ptr [esp+24h]
00C9304B mov ecx,dword ptr [esp+44h]
00C9304F add esp,10h
00C93052 movss xmm0,dword ptr [esp+38h]
00C93058 movaps xmm1,xmmword ptr [esp+40h]
00C9305D add esi,8
00C93060 comiss xmm0,dword ptr [edi+esi]
00C93064 jae CompleteBoxPruning+260h (0C93020h)

Well at least it does look quite nicer and tighter than before.

The three blocks have been color-coded like in previous versions. The main codepath is only 9 instructions (against 13 in version 11 and 8 in version 12).

There is a single add (00C9305D) in the third block, instead of three before: that’s what we wanted. In that respect, the strategy worked: we did trick the compiler into generating a single add!

But unfortunately both xmm0 and xmm1 are constantly reloaded from the stack in the third block, and that is probably what kills it. Ignore these two instructions, and the code is pretty much what we wanted.

The assembly version (our target code) was like this:

movaps xmm3, xmm2
cmpltps xmm3, xmmword ptr [edx+ecx*2]
movmskps eax, xmm3

And if we move one line from the third block we see that we just got:

00C93058 movaps xmm1,xmmword ptr [esp+40h]
00C93020 cmpltps xmm1,xmmword ptr [ecx+esi*2]
00C93025 movmskps eax,xmm1

So close. But no cigar.

Now what? How do you tell the compiler to stop spilling xmm registers to the stack?

Well: I don’t know. There is probably no way.

But you can try the same strategy as before: when you’re stuck, try random changes. Mutate the code. In particular, you can replace this:

if(_mm_movemask_ps(_mm_cmplt_ps(b, _mm_load_ps(box)))==12)

With that:

if(_mm_movemask_ps(_mm_cmpnle_ps(_mm_load_ps(box), b))==12)

It looks like a rather innocent change, but amazingly, it generates the following assembly:

00AE2FA0 movaps xmm0,xmmword ptr [ecx+esi*2]
00AE2FA4 cmpnleps xmm0,xmm1
00AE2FA8 movmskps eax,xmm0
00AE2FAB cmp eax,0Ch
00AE2FAE jne CompleteBoxPruning+29Ah (0AE2FDAh)
00AE2FB0 push dword ptr [esp+20h]
00AE2FB4 mov eax,edi
00AE2FB6 push dword ptr [pairs]
00AE2FB9 sub eax,edx
00AE2FBB add eax,esi
00AE2FBD sar eax,3
00AE2FC0 push eax
00AE2FC1 push dword ptr [esp+3Ch]
00AE2FC5 call outputPair (0AE2500h)
00AE2FCA movaps xmm1,xmmword ptr [esp+50h]
00AE2FCF mov edx,dword ptr [esp+24h]
00AE2FD3 mov ecx,dword ptr [esp+44h]
00AE2FD7 add esp,10h
00AE2FDA movss xmm0,dword ptr [esp+38h]
00AE2FE0 add esi,8
00AE2FE3 comiss xmm0,dword ptr [edi+esi]
00AE2FE7 jae CompleteBoxPruning+260h (0AE2FA0h)

One of the loads from the stack vanished: it moved from 00C93058 in the previous version to 00AE2FCA now, i.e. it moved from the frequent codepath to the infrequent one. The arguments swap allowed the compiler to do so, because xmm1 is not destroyed by the comparison anymore. The resulting code is now almost exactly the same as our assembly version: the only difference is that xmm0 is reloaded from the stack in the third block (00AE2FDA), while it should instead be kept in a constant xmm register.

This doesn’t look like much, but this innocent change produces a measurable performance gain compared to the version just before. Our revisited C++ version, while still not as neat as the assembly version, is now clearly faster than what we got in version 11:

Home PC (unsafe version):

Complete test (brute force): found 11811 intersections in 782053 K-cycles.
12972 K-cycles.
12162 K-cycles.
12152 K-cycles.
12152 K-cycles.
12422 K-cycles.
12207 K-cycles.
12154 K-cycles.
12151 K-cycles.
12174 K-cycles.
12153 K-cycles.
12148 K-cycles.
12398 K-cycles.
12184 K-cycles.
12152 K-cycles.
12150 K-cycles.
13039 K-cycles.
Complete test (box pruning): found 11725 intersections in 12148 K-cycles.

Office PC (unsafe version):

Complete test (brute force): found 11811 intersections in 815031 K-cycles.
11409 K-cycles.
10029 K-cycles.
11243 K-cycles.
10726 K-cycles.
10055 K-cycles.
10784 K-cycles.
10588 K-cycles.
10548 K-cycles.
10290 K-cycles.
10029 K-cycles.
10408 K-cycles.
10464 K-cycles.
10030 K-cycles.
10475 K-cycles.
10028 K-cycles.
10028 K-cycles.
Complete test (box pruning): found 11725 intersections in 10028 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

(Version12 - 1st)

(14309)

-

-

~6.90

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version13 - unsafe

12148

~2200

~15%

~8.13

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

(Version12 – 1st)

12917

-

-

~7.19

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version13 - unsafe

10028

~2500

~20%

~9.26

The gains are computed between version 13 and version 11. Version 12 is now ignored: we can do pretty much the same without the need for assembly.

But doing the assembly version allowed us to find this hidden optimization. And I had to present the assembly version first, to explain where this was coming from. If I would have switched from version 11 to version 13 directly, it would have looked like black magic.

Now, we only got an “unsafe” version 13, since it just got converted from version 12, for which we only had an “unsafe” version. Fine. So we complete the C++ code as we did before e.g. in version 11, for the safe version. The only change in the inner loop is this:

This generates the proper number of pairs, and the disassembly is exactly the same as for the unsafe version, except cmpnleps is replaced with cmpnltps. But it is otherwise the exact same assembly, same instructions, no extra spilling to the stack, all the same.

And yet…

Office PC (safe version):

Complete test (brute force): found 11811 intersections in 805609 K-cycles.
12524 K-cycles.
11874 K-cycles.
11726 K-cycles.
11726 K-cycles.
11941 K-cycles.
11884 K-cycles.
12073 K-cycles.
11725 K-cycles.
11756 K-cycles.
12752 K-cycles.
12267 K-cycles.
12534 K-cycles.
12274 K-cycles.
12588 K-cycles.
11726 K-cycles.
12184 K-cycles.
Complete test (box pruning): found 11811 intersections in 11725 K-cycles.

Madness. Why is this one slower?

I came back and forth between the assembly and the C++, looking for something I could have overlooked. And then I saw it.

The start of the most inner loop was not aligned on a 16-bytes boundary.

That…. was weird. I could have sworn that the compiler was always aligning the loops. I mean that’s where all these countless nops and lea esp,[esp] instructions come from. I see them all the time in the disassembly. That’s what you’re supposed to do, right? I had this rule in mind but was suddenly unsure where it was coming from. A quick bit of googling revealed it was still a perfectly valid rule, see for example the Intel optimization manual, chapter 3.4.1.5, it says right there:

All branch targets should be 16-byte aligned.

But the compiler apparently does not do that. Or at least: not always. I have no idea why. But that was my new theory anyway: the safe version is slower because the loop is not aligned.

Unfortunately I don’t know how to tell the compiler to align a loop in C++. So just for testing, I added back the same line as in the assembly version before the loop:

_asm align 16

“You’ll never believe what happened next!”

Office PC (safe version):

Complete test (brute force): found 11811 intersections in 846230 K-cycles.
11174 K-cycles.
10626 K-cycles.
10375 K-cycles.
10549 K-cycles.
10053 K-cycles.
10340 K-cycles.
10727 K-cycles.
10709 K-cycles.
10536 K-cycles.
10643 K-cycles.
10578 K-cycles.
10379 K-cycles.
11036 K-cycles.
10793 K-cycles.
10462 K-cycles.
10993 K-cycles.
Complete test (box pruning): found 11811 intersections in 10053 K-cycles.

At that point I was starting to sweat.

Because provided the analysis is correct and this is not just because of something entirely different that I cannot see, then:

  • I have no idea how to fix this without using at least one line of assembly.
  • I have no idea if this random alignment affected the previous versions, whose timings may all be questionable now.

In fact, we had a bit of an unsolved mystery at the end of version 9c, where for some reason the safe & unsafe versions at home showed a measurable performance difference. Rings a bell? That might have been because of this. The safe version adds two instructions in the code flow before the loop starts, so that’s enough to change its alignment…

With the enforced loop alignment, both safe and unsafe versions show the same timings both at home and in the office. Having to use one line of assembly is a bit unfortunate since I was trying to convert back everything to something more portable. That being said, this issue is obviously very compiler-dependent, so maybe that extra line doesn’t need porting – other compilers on other platforms might do the right thing immediately. Who knows?

Not a rhetorical question: seriously, who knows about that stuff? Email me and tell me.

In any case, after adding that line the timings become:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

(Version12 - 1st)

(14309)

-

-

~6.90

(Version12 – 2nd)

(11731)

(~2600)

(~18%)

(~8.42)

Version13 - unsafe

12296

~2000

~14%

~8.03

Version13 - safe

12236

~2200

~15%

~8.07

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

(Version12 – 1st)

12917

-

-

~7.19

(Version12 – 2nd)

(10014)

(~2500)

(~20%)

(~9.27)

Version13 - unsafe

10168

~2400

~19%

~9.13

Version13 - safe

10053

~2500

~20%

~9.23

I will freely admit that for this post, I have no big confidence in my analysis. Forcing the compiler to align the loop seems to fix the issue and provide consistent results.

If nothing else, this whole post shows that one can sometimes tame the compiler and make it generate something closer to the ideal. Even though it’s difficult. And fragile. And not easily maintainable. And results will vary from compiler to compiler. Yeah, ok, fine: it’s as much luck as engineering at this point. All of this is true.

Still, converting the assembly back to C++ remains a fascinating experiment, and the results are there. You may not like the way the C++ code looks now, but this version is almost as fast as the hand-written assembly version.

Pheew.

Optimization: that’s a 24/7 job. Just look at how much time we can spend on one simple function…

What we learnt:

The assembly version showed us the way. While technically it still “wins”, the compiler-generated version stroke back and came very close in terms of performance. The generated code is almost the same as our hand-written assembly.

We reached a point where a single extra movaps in the loop has a measurable cost.

The compiler does not seem to always align loops, so some of them are effectively randomly aligned. Any extra instruction before the loop in the code flow can change the alignment and thus affect performance.

Thus there is no such thing as an “innocent change”.

Are we there yet, Papa Smurf?

No, I’m afraid we are not.

GitHub code for part 13.

Box pruning revisited - part 12 - ASM FTW

February 15th, 2017

Part 12 – ASM FTW

Fasten your seatbelts; this one is going to be a bit bumpy. Last time we were left with unanswered questions and doubts about the quality of the compiler-generated code.

Those who know me probably saw it coming from a mile away: this time we’ll rewrite the loop in assembly. Old-school, baby.

It is not always useful, but it is often informative and eye-opening.

To make things easier, we will limit ourselves to a small part of the loop. We are not going to rewrite the whole function. Just this snippet:

And we’re only going to do the “unsafe” version, since that’s (slightly) less work.

The easiest way to do the conversion is to use inline assembly in Visual Studio (you know now why this whole experiment is using a Win32 project - inline assembly is not supported anymore in Win64 builds).

You just comment out lines and start replacing them with blocks of _asm code. The assembly code there can directly access some of the C++ variables, it’s all rather simple. Sort of.

For example this:

Becomes:

Then this:

Becomes for example:

And once the loop is done and you double-checked it iterates the proper number of times (it’s easy to compare against the C code when both are right there in the same file), you just add the missing bits.

The SIMD overlap test:

…will need some code before the loop to compute the initial address:

And then the actual test inside the loop will be something like:

At this point you don’t immediately write the code to output the pair. First you just increase a counter (using a free register like maybe ebx) and you count the number of overlaps you get. Then you printf the results and compare it to the C code. Is the number correct? Good, then you can now complete the function and actually output the pair. Writing a push_back in assembly is tedious and pointless, so you don’t actually need to do it. Just move the corresponding code to a C function like this:

And then call it from the assembly block, like this:

So there are a couple of subtleties in there.

The called function must have a __cdecl signature, because we sent parameters to it via the stack. Alternatively, compile the whole project with the /Gd compile option (it’s the default anyway).

If you omit __cdecl in the function signature and compile the whole program with the __fastcall convention (/Gr compile option), then it will crash when trying to call “outputPair“.

The parameters are passed in reverse order, that’s the convention. Note how you can directly use the Remap, pairs and RIndex0 variables from the assembly code. They’re all C++ variables, the compiler knows what to do. It’s very convenient. After the call the “add esp, 16” fixes the stack pointer - it’s 16 because we pushed four 4-bytes elements to the stack before calling the function.

Now the function call is wrapped by a pushad/popad couple. It’s the lazy man’s way to save and restore all registers to/from the stack. This makes sure that whatever happens in outputPair, it’s not going to modify the values kept in our CPU registers.

And then the movaps on xmm1 and xmm2 do the same save & restore for the two xmm registers we care about. These ones are easy to forget because most of the time outputPair does not change the xmm registers… but it does when there is a re-allocation within the container. If you are not aware of that, this can be a hard-to-find bug.

Right.

If you put all this together you get this first, “naive” version:

Let’s see how it fares against the questions raised in part 11.

Question 1 was: why write the code in such a way that it destroys the constant box?

Answer: because we don’t have a choice. We do the same in our assembly version: xmm3 is also the constant box, and it’s also destroyed by the cmpltps instruction, exactly like in the compiler-generated version. It turns out that there is no alternative. We used this intrinsic:

_mm_cmpgt_ps

But what we got in the assembly was:

cmpltps

Do you spot the difference? It’s “greater than” in the intrinsic, and “lower than” in the actual assembly. It’s the intrinsic trap again. You can see it here for example. The intrinsic is marked as CMPLTPSr, which means there is no actual cmpgtps instruction: it does not exist. So the compiler uses the “reverse” instruction instead (CMPLTPS) and swaps the arguments. That gives birth to the code we saw.

Beyond that, even if cmpgtps would exist, one way or another the comparison instruction would overwrite the contents of one register. It can be either our constant box, or the freshly loaded data for the new box. And both alternatives come down to the same amount of instructions: one load, and one cmp. We cannot do less.

So, it’s a wash. The compiler was not to blame here, our question was simply naïve.

———-

Question 2 was: why reload the constant box from the stack? Why not keeping it in a register? The whole loop only uses two xmm registers (!) so it’s not like there’s a shortage of them.

=> We fixed this. The constant box is now kept in xmm2, which is only saved and restored when an overlap occurs. We win.

———-

Question 3 was: why do you reload it from the stack each time? It’s a constant for the whole loop. There are free available xmm registers. Just put it in xmm2, keep it there, and the “movss” instruction above just vanishes.

=> We fixed this. This was about MaxLimit, which is now kept in xmm1. The “movss” disappeared indeed. We win.

———-

Question 4 was about the BoxListX array being constantly written to the stack (and sometimes reloaded from it, with VC10).

=> We fixed this. It’s now kept in the esi register at all time. We win.

Question 5 was: why do you save edx to the stack all the time (address 003E3028)? Why don’t you just save and restore it only when an overlap actually occurs?

Question 6 was: same question for ecx.

=> We fixed this. We don’t write any CPU registers to the stack, unless an overlap occurs. We win.

———-

Sounds good!

The most often used codepath is now 10 instructions, compared to 13 before. Exciting!

But then you run the benchmark and…

Home PC:

Complete test (brute force): found 11811 intersections in 781774 K-cycles.
15193 K-cycles.
14574 K-cycles.
14399 K-cycles.
14310 K-cycles.
14619 K-cycles.
14366 K-cycles.
14310 K-cycles.
14322 K-cycles.
14310 K-cycles.
14310 K-cycles.
14309 K-cycles.
14740 K-cycles.
14346 K-cycles.
14311 K-cycles.
14375 K-cycles.
14374 K-cycles.
Complete test (box pruning): found 11715 intersections in 14309 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 812640 K-cycles.
13814 K-cycles.
13636 K-cycles.
13544 K-cycles.
13745 K-cycles.
13547 K-cycles.
13610 K-cycles.
13549 K-cycles.
13423 K-cycles.
12929 K-cycles.
13498 K-cycles.
13052 K-cycles.
13213 K-cycles.
12917 K-cycles.
13166 K-cycles.
13510 K-cycles.
13817 K-cycles.
Complete test (box pruning): found 11715 intersections in 12917 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Version12

14309

-

-

~6.90

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version12

12917

-

-

~7.19

…it’s pretty much exactly the same speed as before, or even marginally slower.

You didn’t think it would be that easy, did you?

It’s not.

But it was important to go through this process: it gives a better understanding of why the compiler made certain choices, and it shows why conventional wisdom says that it is hard to beat the compiler these days. Most of the time, the hand-written version is indeed slower than the compiler-generated version.

That would be the conclusion you would stop at, if this would be the conclusion you wanted to reach.

Well…

Old habits die hard and ex-demomakers cannot stop here I’m afraid.

The reality is that you cannot do worse than the compiler: just start from the compiler-generated version. How can you do worse if you copy what it did and improve from there?

The real benefit of the assembly version is that it opens the door for more tinkering. It’s pretty hard (if not impossible) to make the compiler generate exactly what you want, to create various experiments. So for example you cannot easily remove a specific instruction in the code flow to check its impact on performance. But the assembly version lets you do whatever you want, without interference.

And we’re certainly going to exploit this.

What we did see so far, fair enough, is that our naive ideas about this code were wrong: keeping things in registers and avoiding spilling to the stack is apparently not such a big deal. This much is very true: these days it’s very difficult to foresee how a change will affect performance. But that is exactly why the assembly version helps: it allows you to do very precise experiments, and see the impact of each line.

For example you can comment out the part that outputs the pairs, without the optimizing compiler removing the whole SIMD test when it discovers that it now can.

Do that and you discover that the timings barely change. These writes to output overlapping pairs are virtually free.

Continue doing this meticulously, and you will soon reach that one:

inc edi // Index1++

Comment it out, and…

Complete test (brute force): found 11811 intersections in 822828 K-cycles.
11727 K-cycles.
10996 K-cycles.
11431 K-cycles.
10930 K-cycles.
10984 K-cycles.
11271 K-cycles.
11632 K-cycles.
11412 K-cycles.
11621 K-cycles.
12076 K-cycles.
11480 K-cycles.
11624 K-cycles.
11244 K-cycles.
10963 K-cycles.
11492 K-cycles.
11074 K-cycles.
Complete test (box pruning): found 11715 intersections in 10930 K-cycles.

Wo-ah.

WO-AH.

This is now significantly faster, even though edi is only used when we output pairs. So it could be that not increasing edi removes some cache misses, because we use it as an index in the remap table, and if it’s constant we remap the same index all the time, so, less cache misses. Except we just saw that removing the whole block of code writing out the pairs barely had an impact on performance. Hmmm.

So, scratching head, observing, formulating a theory: is that because we use “inc” instead of “add” ? Testing: nope, replacing “inc edi” with “add edi, 1” doesn’t change anything.

More theories led to more tests that didn’t reveal anything. For some reason, increasing edi there is costly. I admit I still don’t know why.

But at the end of the day, the “why” doesn’t matter much. At this level it’s a bit like quantum mechanics: there are a lot of weird rules at play, weird things happen, and trying to explain “why” in an intuitive way doesn’t always work. What matters more is that the assembly version allowed us to discover a weakness in the code, and we would never have seen that in the C++ version, because it was only:

Index1++;

Which gave birth to three different adds:

003E301C add ecx,8
003E301F add edx,4
003E3022 add edi,10h

We don’t have any control over these assembly adds in the C++ code, because there’s only one line there, and we cannot just remove it without breaking the whole loop. The assembly version however gives us control over the atomic instructions, for more fine-grained experiments. Experimenting at C++ level is like playing the piano with boxing gloves: you end up creating noise rather than music.

Note also how that one was found by accident: random changes. Random mutations, as if we’d be running a genetic algorithm on the code sequence. Random jumps to get out of a local minimum. I remember a competition that once took place in the office between me and a co-worker. At some point I came up with a counter-intuitive piece of code, full of branches, that turned out to be faster than all the branchless versions we had tried before. That co-worker told me something I’d never forget: “I would never have tried that!“. It was said with a tone clearly expressing his disgust at what looked like a terrible idea and a terrible piece of code. And that’s the lesson here: assumptions are the mother of all fuckups, you know that quote. When you’re that close to the metal, sometimes it’s good to forget what you think you know, and just try things. Even stupid things. Especially stupid things.

Sometimes, like here, it works. Increasing edi is slow? Fine, I don’t need to know why: I just won’t do it.

With this clear goal in mind, writing a new version doesn’t take long.

The new strategy is to use a single offset (ecx) to address both the BoxListX and BoxListYZ arrays. Then, when an overlap occurs, we recompute the box index from that single offset, instead of increasing a counter each time. As a result, we can replace the inc / add / add from the previous version with just one add. The most used codepath is now only 8 instructions (compared to the 13 from the compiler-generated version), and most importantly we don’t touch edi anymore.

So, did that work this time?

Yep:

Home PC:

Complete test (brute force): found 11811 intersections in 781991 K-cycles.
15514 K-cycles.
11767 K-cycles.
11733 K-cycles.
11734 K-cycles.
12045 K-cycles.
11758 K-cycles.
11737 K-cycles.
11736 K-cycles.
11748 K-cycles.
11744 K-cycles.
11733 K-cycles.
11736 K-cycles.
11755 K-cycles.
11758 K-cycles.
11736 K-cycles.
11731 K-cycles.
Complete test (box pruning): found 11725 intersections in 11731 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 820108 K-cycles.
10815 K-cycles.
10923 K-cycles.
10528 K-cycles.
10509 K-cycles.
10804 K-cycles.
10524 K-cycles.
10921 K-cycles.
10027 K-cycles.
10815 K-cycles.
10792 K-cycles.
10019 K-cycles.
10016 K-cycles.
10983 K-cycles.
10016 K-cycles.
10495 K-cycles.
10014 K-cycles.
Complete test (box pruning): found 11725 intersections in 10014 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Version12 - first

14309

-

-

~6.90

Version12 - second

11731

~2600

~18%

~8.42

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version12 - first

12917

-

-

~7.19

Version12 - second

10014

~2500

~20%

~9.27

There you go! We’re getting close to a 10X speedup…

Who said hand-written assembly cannot beat the compiler?

Now it would be tempting to be smug about it, conclude that “assembly rules” or something, that compilers are “lame”, or any of the many things ex demo-coders are fond of saying.

But things in the real world are not that black-or-white, as I will demonstrate in the next post.

Stay tuned. We are not done.

What we learnt:

An assembly version is often useful to let us play with the code, experiment, and discover potential performance gains that wouldn’t have been easy to spot with the C++ version.

Hand-written assembly can still be faster than the compiler-generated version. Or so it seems.

GitHub code for part 12

Box pruning revisited - part 11 - the last branch

February 14th, 2017

Part 11 – the last branch

The SIMD version gave us a good speedup. The best scalar version was also quite successful, thanks to the removal of several hard-to-predict branches.

Looking at the remaining code with this in mind, one more branch jumps to the eyes, just before the SIMD overlap test:

This ensures that we never test a box against itself, but I don’t quite remember why it’s there. The way the code is written, it cannot happen. So this test can just be removed. Since it never happens, the branch is likely to be always correctly predicted, and thus the expected gains are small, if any.

But still, worth doing - it’s just one line.

And the results are actually interesting:

Home PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 781447 K-cycles.
15224 K-cycles.
14386 K-cycles.
14633 K-cycles.
14387 K-cycles.
14376 K-cycles.
14392 K-cycles.
14376 K-cycles.
14373 K-cycles.
14579 K-cycles.
14400 K-cycles.
14374 K-cycles.
14388 K-cycles.
14373 K-cycles.
14372 K-cycles.
14471 K-cycles.
14404 K-cycles.
Complete test (box pruning): found 11715 intersections in 14372 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 781701 K-cycles.
15306 K-cycles.
14533 K-cycles.
14513 K-cycles.
14731 K-cycles.
14552 K-cycles.
14512 K-cycles.
14528 K-cycles.
14514 K-cycles.
14514 K-cycles.
14528 K-cycles.
14535 K-cycles.
14513 K-cycles.
14526 K-cycles.
14515 K-cycles.
14515 K-cycles.
14621 K-cycles.
Complete test (box pruning): found 11811 intersections in 14512 K-cycles.

Office PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 824867 K-cycles.
13618 K-cycles.
12900 K-cycles.
12573 K-cycles.
12957 K-cycles.
12570 K-cycles.
12911 K-cycles.
12572 K-cycles.
12573 K-cycles.
12588 K-cycles.
13153 K-cycles.
13447 K-cycles.
13212 K-cycles.
13429 K-cycles.
13214 K-cycles.
13527 K-cycles.
13229 K-cycles.
Complete test (box pruning): found 11715 intersections in 12570 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 816240 K-cycles.
13227 K-cycles.
13013 K-cycles.
12621 K-cycles.
14329 K-cycles.
12624 K-cycles.
13009 K-cycles.
13533 K-cycles.
13171 K-cycles.
12881 K-cycles.
13181 K-cycles.
13265 K-cycles.
13248 K-cycles.
13245 K-cycles.
13242 K-cycles.
13267 K-cycles.
12611 K-cycles.
Complete test (box pruning): found 11811 intersections in 12611 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version 11 is compared to version 9c here.

No difference on the office PC, but on the home PC the safe & unsafe versions are suddenly the same speed. Hmmm. We’ll never solve that particular mystery then. Oh well. All is good now at least.

Now what? Are we done?

We are about 7x faster than when we started, which is pretty good!

…but not quite the order of magnitude we wanted to reach.

At this point there isn’t much left in the inner loop, at least on the C++ side of things. So let’s check the new disassembly. The inner loop we’re interested in is:

And the corresponding disassembly is simple enough to isolate:

//SIMD_OVERLAP_TEST(BoxListYZ[Index1])
003E2FB3 cmpltps xmm1,xmmword ptr [edi]
003E2FB7 movmskps eax,xmm1
003E2FBA cmp eax,0Ch
003E2FBD jne CompleteBoxPruning+2D1h (03E3011h)
//pairs.Add(RIndex0).Add(Remap[Index1]);
003E2FBF mov eax,dword ptr [esi+4]
003E2FC2 cmp eax,dword ptr [esi]
003E2FC4 jne CompleteBoxPruning+28Fh (03E2FCFh)
003E2FC6 push 1
003E2FC8 mov ecx,esi
003E2FCA call IceCore::Container::Resize (03E1350h)
003E2FCF mov ecx,dword ptr [esi+4]
003E2FD2 mov eax,dword ptr [esi+8]
003E2FD5 mov edx,dword ptr [esp+34h]
003E2FD9 mov dword ptr [eax+ecx*4],edx
003E2FDC inc dword ptr [esi+4]
003E2FDF mov edx,dword ptr [esp+20h]
003E2FE3 mov eax,dword ptr [esi+4]
003E2FE6 mov ecx,dword ptr [edx]
003E2FE8 mov dword ptr [esp+30h],ecx
003E2FEC cmp eax,dword ptr [esi]
003E2FEE jne CompleteBoxPruning+2B9h (03E2FF9h)
003E2FF0 push 1
003E2FF2 mov ecx,esi
003E2FF4 call IceCore::Container::Resize (03E1350h)
003E2FF9 mov ecx,dword ptr [esi+4]
003E2FFC mov eax,dword ptr [esi+8]
003E2FFF mov edx,dword ptr [esp+30h]
003E3003 mov dword ptr [eax+ecx*4],edx
003E3006 inc dword ptr [esi+4]
003E3009 mov ecx,dword ptr [esp+28h]
003E300D mov edx,dword ptr [esp+20h]
// while(BoxListX[Index1].mMinX<=MaxLimit)
003E3011 movss xmm0,dword ptr [esp+38h]
003E3017 movaps xmm1,xmmword ptr [esp+40h]
003E301C add ecx,8
003E301F add edx,4
003E3022 add edi,10h
003E3025 comiss xmm0,dword ptr [ecx]
003E3028 mov dword ptr [esp+20h],edx
003E302C mov dword ptr [esp+28h],ecx
003E3030 jae CompleteBoxPruning+273h (03E2FB3h)

I added two blank lines and color-coded the whole thing to clearly delimitate three blocks of code.

Roughly, the first block is the SIMD overlap test, the second block is for writing the pair indices when an overlap is found, and the last block is the while loop (easily identified by the comiss instruction).

It’s… not great.

Where do I start?

———————

In the first two lines, since the xmmword ptr [edi] is the actual load of BoxListYZ[Index1], writing it this way means that xmm1 contains our constant box for the loop (the preloaded Box0YZ in the C++ code):

003E2FB3 cmpltps xmm1,xmmword ptr [edi]

003E2FB7 movmskps eax,xmm1

But the cmpltps instruction is always going to destroy the contents of xmm1. So this register will need to be reloaded for the next iteration. And indeed, that’s what happens in the third block:

003E3017 movaps xmm1,xmmword ptr [esp+40h]

We also confirm that edi is the BoxListYZ array, thanks to:

003E3022 add edi,10h

10h == 16 == sizeof(SIMD_AABB_YZ), makes sense.

Question 1: why write the code in such a way that it destroys the constant box?

Question 2: why reload the constant box from the stack? Why not keeping it in a register? The whole loop only uses two xmm registers (!) so it’s not like there’s a shortage of them.

———————

In the third block, we have this:

003E301C add ecx,8

And then:

003E3025 comiss xmm0,dword ptr [ecx]

Which means that ecx is the BoxListX array (with sizeof(SIMD_AABB_X)==8), and dword ptr [ecx] (a scalar load) is the assembly for “BoxListX[Index1].mMinX” in the C++ code.

Makes sense.

But then it also means that xmm0 is “MaxLimit“, which gets reloaded just before from the stack:

003E3011 movss xmm0,dword ptr [esp+38h]

Doesn’t make sense.

Question 3: Why do you reload it from the stack each time? It’s a constant for the whole loop. There are free available xmm registers. Just put it in xmm2, keep it there, and the “movss” instruction above just vanishes.

———————

In the third block again, we have this:

003E301F add edx,4

003E3028 mov dword ptr [esp+20h],edx

003E302C mov dword ptr [esp+28h],ecx

We saw ecx before, it’s the BoxListX array.

Question 4: why do you push it to the stack? Why don’t you keep it in a register?

———————

Then there’s edx, incremented by 4 and also pushed to the stack. If you look for [esp+20h] to check where we read it again, you find two places:

003E2FDF mov edx,dword ptr [esp+20h]

And:

003E300D mov edx,dword ptr [esp+20h]

The first place happens after the first “push back” call, and corresponds to this C++ code:

Add(Remap[Index1])

The second place happens after the second “push back” call, and simply restores the edx register, which is not actually used when the SIMD-overlap test doesn’t find an overlap.

You guessed it, edx is Index1, or rather Remap[Index1] as you can see from these lines in the second block:

003E2FDF mov edx,dword ptr [esp+20h]

003E2FE6 mov ecx,dword ptr [edx]

Right. Ok. But then…

Question 5: why do you save edx to the stack all the time (address 003E3028)? Why don’t you just save and restore it only when an overlap actually occurs?

Question 6: same question for ecx.

———————

As for the second block, it looks fine overall. Still, I know some people will ask this one:

Question 7: why do you use a custom vector class instead of std::vector?

Ok, let’s quickly deal with that one first, it won’t take long. Replace the custom container with std::vector and you get:

Home PC:

Unsafe:

Complete test (brute force): found 11811 intersections in 851569 K-cycles.
19727 K-cycles.
18913 K-cycles.
18881 K-cycles.
18879 K-cycles.
18908 K-cycles.
18901 K-cycles.
18891 K-cycles.
18882 K-cycles.
18878 K-cycles.
19110 K-cycles.
18902 K-cycles.
18901 K-cycles.
18881 K-cycles.
18895 K-cycles.
18904 K-cycles.
18882 K-cycles.
Complete test (box pruning): found 11715 intersections in 18878 K-cycles.

Office PC:

Unsafe:

Complete test (brute force): found 11811 intersections in 888075 K-cycles.
19200 K-cycles.
18307 K-cycles.
18584 K-cycles.
18729 K-cycles.
18306 K-cycles.
18647 K-cycles.
18571 K-cycles.
18306 K-cycles.
18767 K-cycles.
18551 K-cycles.
18420 K-cycles.
18659 K-cycles.
18530 K-cycles.
18365 K-cycles.
18491 K-cycles.
18310 K-cycles.
Complete test (box pruning): found 11715 intersections in 18306 K-cycles.

That is:

Home PC

Timings

Version11 - unsafe - ICE container

14372

Version11 - unsafe - std::vector

18878

Office PC

Timings

Version11 - unsafe - ICE container

12570

Version11 - unsafe - std::vector

18306

Just look at the numbers. That’s why I’m not using std::vector. It has never been able to implement a simple “push_back” correctly. It’s 2017 and std::vector still cannot match the first basic C++ class I wrote back in 1999.

There is no need to tell me about STLPort or EASTL or whatever: I’ve heard it all. If I cannot rely on the version included with Visual Studio, if I have to switch to some external library, I’d rather use my own. At least I know it’s properly implemented.

Just for “fun”, with the custom container the second block above is about 27 lines of code. Right?

Ok, now take a deep breath, here’s the equivalent second block with std::vector:

011836B0 mov ecx,dword ptr [esi+4]
011836B3 lea eax,[esp+28h]
011836B7 cmp eax,ecx
011836B9 jae CompleteBoxPruning+316h (01183736h)
011836BB mov edi,dword ptr [esi]
011836BD cmp edi,eax
011836BF ja CompleteBoxPruning+316h (01183736h)
011836C1 mov edx,dword ptr [esi+8]
011836C4 sub eax,edi
011836C6 sar eax,2
011836C9 mov dword ptr [esp+40h],eax
011836CD cmp ecx,edx
011836CF jne CompleteBoxPruning+302h (01183722h)
011836D1 mov eax,edx
011836D3 sub eax,ecx
011836D5 sar eax,2
011836D8 cmp eax,1
011836DB jae CompleteBoxPruning+302h (01183722h)
011836DD sub ecx,edi
011836DF sar ecx,2
011836E2 mov eax,3FFFFFFFh
011836E7 sub eax,ecx
011836E9 cmp eax,1
011836EC jb CompleteBoxPruning+4F9h (01183919h)
011836F2 inc ecx
011836F3 sub edx,edi
011836F5 sar edx,2
011836F8 mov dword ptr [esp+3Ch],ecx
011836FC mov ecx,edx
011836FE shr ecx,1
01183700 mov eax,3FFFFFFFh
01183705 sub eax,ecx
01183707 cmp eax,edx
01183709 jae CompleteBoxPruning+2EFh (0118370Fh)
0118370B xor edx,edx
0118370D jmp CompleteBoxPruning+2F1h (01183711h)
0118370F add edx,ecx
01183711 cmp edx,dword ptr [esp+3Ch]
01183715 mov ecx,esi
01183717 cmovb edx,dword ptr [esp+3Ch]
0118371C push edx
0118371D call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183722 mov edx,dword ptr [esi+4]
01183725 test edx,edx
01183727 je CompleteBoxPruning+37Dh (0118379Dh)
01183729 mov eax,dword ptr [esi]
0118372B mov ecx,dword ptr [esp+40h]
0118372F mov eax,dword ptr [eax+ecx*4]
01183732 mov dword ptr [edx],eax
01183734 jmp CompleteBoxPruning+37Dh (0118379Dh)
01183736 mov edx,dword ptr [esi+8]
01183739 cmp ecx,edx
0118373B jne CompleteBoxPruning+370h (01183790h)
0118373D mov eax,edx
0118373F sub eax,ecx
01183741 sar eax,2
01183744 cmp eax,1
01183747 jae CompleteBoxPruning+370h (01183790h)
01183749 mov edi,dword ptr [esi]
0118374B sub ecx,edi
0118374D sar ecx,2
01183750 mov eax,3FFFFFFFh
01183755 sub eax,ecx
01183757 cmp eax,1
0118375A jb CompleteBoxPruning+4F9h (01183919h)
01183760 inc ecx
01183761 sub edx,edi
01183763 sar edx,2
01183766 mov dword ptr [esp+40h],ecx
0118376A mov ecx,edx
0118376C shr ecx,1
0118376E mov eax,3FFFFFFFh
01183773 sub eax,ecx
01183775 cmp eax,edx
01183777 jae CompleteBoxPruning+35Dh (0118377Dh)
01183779 xor edx,edx
0118377B jmp CompleteBoxPruning+35Fh (0118377Fh)
0118377D add edx,ecx
0118377F cmp edx,dword ptr [esp+40h]
01183783 mov ecx,esi
01183785 cmovb edx,dword ptr [esp+40h]
0118378A push edx
0118378B call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183790 mov eax,dword ptr [esi+4]
01183793 test eax,eax
01183795 je CompleteBoxPruning+37Dh (0118379Dh)
01183797 mov ecx,dword ptr [esp+44h]
0118379B mov dword ptr [eax],ecx
0118379D add dword ptr [esi+4],4
011837A1 mov ecx,dword ptr [esi+4]
011837A4 mov eax,dword ptr [esp+14h]
011837A8 cmp eax,ecx
011837AA jae CompleteBoxPruning+405h (01183825h)
011837AC mov edi,dword ptr [esi]
011837AE cmp edi,eax
011837B0 ja CompleteBoxPruning+405h (01183825h)
011837B2 mov edx,dword ptr [esi+8]
011837B5 sub eax,edi
011837B7 sar eax,2
011837BA mov dword ptr [esp+3Ch],eax
011837BE cmp ecx,edx
011837C0 jne CompleteBoxPruning+3F3h (01183813h)
011837C2 mov eax,edx
011837C4 sub eax,ecx
011837C6 sar eax,2
011837C9 cmp eax,1
011837CC jae CompleteBoxPruning+3F3h (01183813h)
011837CE sub ecx,edi
011837D0 sar ecx,2
011837D3 mov eax,3FFFFFFFh
011837D8 sub eax,ecx
011837DA cmp eax,1
011837DD jb CompleteBoxPruning+4F9h (01183919h)
011837E3 inc ecx
011837E4 sub edx,edi
011837E6 sar edx,2
011837E9 mov dword ptr [esp+40h],ecx
011837ED mov ecx,edx
011837EF shr ecx,1
011837F1 mov eax,3FFFFFFFh
011837F6 sub eax,ecx
011837F8 cmp eax,edx
011837FA jae CompleteBoxPruning+3E0h (01183800h)
011837FC xor edx,edx
011837FE jmp CompleteBoxPruning+3E2h (01183802h)
01183800 add edx,ecx
01183802 cmp edx,dword ptr [esp+40h]
01183806 mov ecx,esi
01183808 cmovb edx,dword ptr [esp+40h]
0118380D push edx
0118380E call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183813 mov ecx,dword ptr [esi+4]
01183816 test ecx,ecx
01183818 je CompleteBoxPruning+46Eh (0118388Eh)
0118381A mov eax,dword ptr [esi]
0118381C mov edx,dword ptr [esp+3Ch]
01183820 mov eax,dword ptr [eax+edx*4]
01183823 jmp CompleteBoxPruning+46Ch (0118388Ch)
01183825 mov edx,dword ptr [esi+8]
01183828 cmp ecx,edx
0118382A jne CompleteBoxPruning+463h (01183883h)
0118382C mov eax,edx
0118382E sub eax,ecx
01183830 sar eax,2
01183833 cmp eax,1
01183836 jae CompleteBoxPruning+45Fh (0118387Fh)
01183838 mov edi,dword ptr [esi]
0118383A sub ecx,edi
0118383C sar ecx,2
0118383F mov eax,3FFFFFFFh
01183844 sub eax,ecx
01183846 cmp eax,1
01183849 jb CompleteBoxPruning+4F9h (01183919h)
0118384F inc ecx
01183850 sub edx,edi
01183852 sar edx,2
01183855 mov dword ptr [esp+40h],ecx
01183859 mov ecx,edx
0118385B shr ecx,1
0118385D mov eax,3FFFFFFFh
01183862 sub eax,ecx
01183864 cmp eax,edx
01183866 jae CompleteBoxPruning+44Ch (0118386Ch)
01183868 xor edx,edx
0118386A jmp CompleteBoxPruning+44Eh (0118386Eh)
0118386C add edx,ecx
0118386E cmp edx,dword ptr [esp+40h]
01183872 mov ecx,esi
01183874 cmovb edx,dword ptr [esp+40h]
01183879 push edx
0118387A call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
0118387F mov eax,dword ptr [esp+14h]
01183883 mov ecx,dword ptr [esi+4]
01183886 test ecx,ecx
01183888 je CompleteBoxPruning+46Eh (0118388Eh)
0118388A mov eax,dword ptr [eax]
0118388C mov dword ptr [ecx],eax
0118388E mov edx,dword ptr [esp+2Ch]
01183892 mov ecx,dword ptr [esp+14h]
01183896 add dword ptr [esi+4],4

That’s 180 lines of code. Instead of 27. It’s just ridiculous.

That’s all I have to say about std::vector.

———————

Now that this interlude is over, let’s go back to the sane version presented before.

The most commonly executed codepath (when an overlap is not found) has 13 instructions (it corresponds to the first block + the last block). And for 13 instructions we raised 6 potential issues. That’s a lot of questions for such a small piece of code.

Of course, we might be missing something. The compiler might know more than us. Our analysis might be too naive.

Perhaps.

Perhaps not.

How do you answer that?

As before: the scientific way. We observed. Now we need to come up with theories and tests to discern between facts and fiction.

———————

Theory: the compiler is not recent enough. A more recent compiler would produce better code.

Yeah. I heard that one before…. for every version of Visual Studio since VC6… But fair enough, it’s easy to test. In fact let’s try both VC10 (the SSE code generation has changed quite a bit between VC10 and VC11), and then VC14. VC10 gives:

00BE30FA movaps xmm1,xmmword ptr [ebx]
00BE30FD movaps xmm2,xmmword ptr [esp+40h]
00BE3102 cmpltps xmm2,xmm1
00BE3106 movmskps eax,xmm2
00BE3109 cmp eax,0Ch
00BE310C jne CompleteBoxPruning+3FEh (0BE315Eh)
00BE310E mov ecx,dword ptr [esi+4]
00BE3111 cmp ecx,dword ptr [esi]
00BE3113 jne CompleteBoxPruning+3BEh (0BE311Eh)
00BE3115 push 1
00BE3117 mov ecx,esi
00BE3119 call IceCore::Container::Resize (0BE1240h)
00BE311E mov edx,dword ptr [esi+4]
00BE3121 mov eax,dword ptr [esi+8]
00BE3124 mov ecx,dword ptr [esp+30h]
00BE3128 mov dword ptr [eax+edx*4],ecx
00BE312B inc dword ptr [esi+4]
00BE312E mov eax,dword ptr [esi+4]
00BE3131 mov edx,dword ptr [edi]
00BE3133 mov dword ptr [esp+34h],edx
00BE3137 cmp eax,dword ptr [esi]
00BE3139 jne CompleteBoxPruning+3E4h (0BE3144h)
00BE313B push 1
00BE313D mov ecx,esi
00BE313F call IceCore::Container::Resize (0BE1240h)
00BE3144 mov eax,dword ptr [esi+4]
00BE3147 mov ecx,dword ptr [esi+8]
00BE314A mov edx,dword ptr [esp+34h]
00BE314E movss xmm0,dword ptr [esp+38h]
00BE3154 mov dword ptr [ecx+eax*4],edx
00BE3157 inc dword ptr [esi+4]
00BE315A mov ecx,dword ptr [esp+3Ch]
00BE315E mov eax,dword ptr [esp+28h]
00BE3162 add eax,8
00BE3165 add ebx,10h
00BE3168 add edi,4
00BE316B comiss xmm0,dword ptr [eax]
00BE316E mov dword ptr [esp+28h],eax
00BE3172 jae CompleteBoxPruning+39Ah (0BE30FAh)

There are still 13 instructions overall in the main codepath, with a few twists.

It seems to use three xmm registers instead of two, which looks better.

———————

This is the same as in VC11, but using two instructions instead of one:

00BE30FA movaps xmm1,xmmword ptr [ebx]

00BE3102 cmpltps xmm2,xmm1

xmm2 is still destroyed all the time, and reloaded from the stack:

00BE30FD movaps xmm2,xmmword ptr [esp+40h]

The only difference is that it’s reloaded in our first block, rather than in the third block before. But it’s really all the same otherwise - even the stack offset is the same.

Questions 1 and 2: same.

———————

VC10 seems to manage “MaxLimit” better. It’s kept in xmm0 and only reloaded when an overlap occurs:

00BE314E movss xmm0,dword ptr [esp+38h]

Well done.

Question 3: VC10 wins.

———————

In the third block, what used to be ecx is now eax:

00BE3162 add eax,8

00BE316B comiss xmm0,dword ptr [eax]

00BE316E mov dword ptr [esp+28h],eax

And it’s saved to the stack all the time, like in VC11.

As for edi (previously edx), it is not saved to the stack anymore. Yay!

But eax (previously ecx) is now reloaded from the stack all the time:

00BE315E mov eax,dword ptr [esp+28h]

…while it wasn’t before. Oh.

Question 4: VC11 wins.

Question 5 and 6: the same overall.

———————

So from just looking at this short snippet it is unclear which compiler does a better job. But then if you look at the code before that inner loop, you find this:

0BE2F1B fstp dword ptr [eax-58h]
00BE2F1E fld dword ptr [edx+8]
00BE2F21 fstp dword ptr [eax-54h]
00BE2F24 fld dword ptr [edx+10h]
00BE2F27 fstp dword ptr [eax-50h]
00BE2F2A fld dword ptr [edx+14h]
00BE2F2D fstp dword ptr [eax-4Ch]
00BE2F30 mov edx,dword ptr [esi-14h]
00BE2F33 mov dword ptr [edi-14h],edx
00BE2F36 lea edx,[edx+edx*2]
00BE2F39 fld dword ptr [ebx+edx*8]
00BE2F3C lea edx,[ebx+edx*8]
00BE2F3F fstp dword ptr [ecx-2Ch]
00BE2F42 fld dword ptr [edx+0Ch]
00BE2F45 fstp dword ptr [ecx-28h]
00BE2F48 fld dword ptr [edx+4]
00BE2F4B fstp dword ptr [eax-48h]
00BE2F4E fld dword ptr [edx+8]
00BE2F51 fstp dword ptr [eax-44h]
00BE2F54 fld dword ptr [edx+10h]
00BE2F57 fstp dword ptr [eax-40h]
00BE2F5A fld dword ptr [edx+14h]
00BE2F5D fstp dword ptr [eax-3Ch]
00BE2F60 mov edx,dword ptr [esi-10h]
00BE2F63 mov dword ptr [edi-10h],edx
00BE2F66 lea edx,[edx+edx*2]
00BE2F69 fld dword ptr [ebx+edx*8]
00BE2F6C lea edx,[ebx+edx*8]
00BE2F6F fstp dword ptr [ecx-24h]
00BE2F72 fld dword ptr [edx+0Ch]
00BE2F75 fstp dword ptr [ecx-20h]
00BE2F78 fld dword ptr [edx+4]
00BE2F7B fstp dword ptr [eax-38h]
00BE2F7E fld dword ptr [edx+8]
00BE2F81 fstp dword ptr [eax-34h]
00BE2F84 fld dword ptr [edx+10h]
00BE2F87 fstp dword ptr [eax-30h]
00BE2F8A fld dword ptr [edx+14h]
00BE2F8D fstp dword ptr [eax-2Ch]
00BE2F90 mov edx,dword ptr [esi-0Ch]
00BE2F93 mov dword ptr [edi-0Ch],edx
00BE2F96 lea edx,[edx+edx*2]
00BE2F99 fld dword ptr [ebx+edx*8]
00BE2F9C lea edx,[ebx+edx*8]
00BE2F9F fstp dword ptr [ecx-1Ch]
00BE2FA2 fld dword ptr [edx+0Ch]
00BE2FA5 fstp dword ptr [ecx-18h]
00BE2FA8 fld dword ptr [edx+4]
00BE2FAB fstp dword ptr [eax-28h]
00BE2FAE fld dword ptr [edx+8]
00BE2FB1 fstp dword ptr [eax-24h]
00BE2FB4 fld dword ptr [edx+10h]
00BE2FB7 fstp dword ptr [eax-20h]
00BE2FBA fld dword ptr [edx+14h]
00BE2FBD fstp dword ptr [eax-1Ch]

Yikes!!

Ok: VC10 still used x87 instructions here and there. VC11 did not.

Forget that inner loop analysis: advantage VC11, big time. The benchmark results confirm this, regardless of the instructions count:

Office PC / unsafe version:

Complete test (brute force): found 11811 intersections in 891760 K-cycles.
19379 K-cycles.
18483 K-cycles.
18943 K-cycles.
18093 K-cycles.
18321 K-cycles.
18417 K-cycles.
18078 K-cycles.
19035 K-cycles.
18500 K-cycles.
18092 K-cycles.
18329 K-cycles.
18092 K-cycles.
17899 K-cycles.
18524 K-cycles.
18085 K-cycles.
19308 K-cycles.
Complete test (box pruning): found 11715 intersections in 17899 K-cycles.

=> Much slower than what we got with VC11.

———————

So, VC14 then? Let’s have a look:

00F32F33 cmpltps xmm1,xmmword ptr [edi]
00F32F37 movmskps eax,xmm1
00F32F3A cmp eax,0Ch
00F32F3D jne CompleteBoxPruning+2D1h (0F32F91h)
00F32F3F mov eax,dword ptr [esi+4]
00F32F42 cmp eax,dword ptr [esi]
00F32F44 jne CompleteBoxPruning+28Fh (0F32F4Fh)
00F32F46 push 1
00F32F48 mov ecx,esi
00F32F4A call IceCore::Container::Resize (0F312D0h)
00F32F4F mov ecx,dword ptr [esi+4]
00F32F52 mov eax,dword ptr [esi+8]
00F32F55 mov edx,dword ptr [esp+34h]
00F32F59 mov dword ptr [eax+ecx*4],edx
00F32F5C mov edx,dword ptr [esp+20h]
00F32F60 inc dword ptr [esi+4]
00F32F63 mov eax,dword ptr [esi+4]
00F32F66 mov ecx,dword ptr [edx]
00F32F68 mov dword ptr [esp+30h],ecx
00F32F6C cmp eax,dword ptr [esi]
00F32F6E jne CompleteBoxPruning+2B9h (0F32F79h)
00F32F70 push 1
00F32F72 mov ecx,esi
00F32F74 call IceCore::Container::Resize (0F312D0h)
00F32F79 mov ecx,dword ptr [esi+4]
00F32F7C mov eax,dword ptr [esi+8]
00F32F7F mov edx,dword ptr [esp+30h]
00F32F83 mov dword ptr [eax+ecx*4],edx
00F32F86 inc dword ptr [esi+4]
00F32F89 mov ecx,dword ptr [esp+28h]
00F32F8D mov edx,dword ptr [esp+20h]
00F32F91 movss xmm0,dword ptr [esp+38h]
00F32F97 add ecx,8
00F32F9A movaps xmm1,xmmword ptr [esp+40h]
00F32F9F add edx,4
00F32FA2 add edi,10h
00F32FA5 mov dword ptr [esp+20h],edx
00F32FA9 mov dword ptr [esp+28h],ecx
00F32FAD comiss xmm0,dword ptr [ecx]
00F32FB0 jae CompleteBoxPruning+273h (0F32F33h)

This one will be quick: some instructions have been re-ordered in the third block, but other than that it’s exactly the same code as VC11’s. And pretty much exactly the same performance.

So, nope, a newer compiler doesn’t solve the issues.

But are they really issues? There’s more to performance than the instructions count, as we saw before. Are we chasing a red herring?

To find our answers, we must go closer to the metal. And that’s what we will do next time.

———————

What we learnt:

Different compilers will give different results.

Compilers are getting better all the time, but they’re still not perfect.

std::vector sucks.

GitHub code for part 11

Box pruning revisited - part 10 - integer SIMD

February 14th, 2017

Part 10 – integer SIMD

Interlude:

Similar to what we did in version 7, and in sake of completeness, I tried using integer comparisons in the SIMD version as well.

The changes are straightforward: encode floats as integers like we did in version 7, replace SIMD floating-point intrinsics with SIMD integer intrinsics.

There are no special traps here, and not much to report, because at the end of the day the results are slower:

Home PC:

Complete test (brute force): found 11811 intersections in 781780 K-cycles.
17454 K-cycles.
16681 K-cycles.
16669 K-cycles.
17145 K-cycles.
16703 K-cycles.
16683 K-cycles.
16668 K-cycles.
16667 K-cycles.
16862 K-cycles.
16707 K-cycles.
16670 K-cycles.
16689 K-cycles.
16668 K-cycles.
16949 K-cycles.
16710 K-cycles.
16667 K-cycles.
Complete test (box pruning): found 11715 intersections in 16667 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810906 K-cycles.
16607 K-cycles.
15927 K-cycles.
15648 K-cycles.
15971 K-cycles.
15648 K-cycles.
15960 K-cycles.
15742 K-cycles.
15990 K-cycles.
15837 K-cycles.
15741 K-cycles.
15970 K-cycles.
15651 K-cycles.
16247 K-cycles.
15649 K-cycles.
15834 K-cycles.
15738 K-cycles.
Complete test (box pruning): found 11715 intersections in 15648 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Version10

(16667)

-

-

-

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

Version10

(15648)

-

-

-

What we learnt:

Don’t bother with integer comparisons anymore.

We are still using integer SIMD in PhysX for this, so it appears that the PhysX version is sub-optimal. Expect some performance gains in PhysX 3.5.

GitHub code for part 10

Box pruning revisited - part 9c - data alignment

February 14th, 2017

Part 9c – data alignment

Last time we were left with a bit of a mystery. Our optimizations produced consistent results so far, on both machines the code was benchmarked on.

However things changed when switching to SIMD.

We have a naive SIMD implementation (9a) producing these results:

Home PC: 34486 K-cycles.

Office PC: 31864 K-cycles.

And an optimized SIMD implementation (9b) producing these results (for the “safe” version, but timings are similar for the unsafe one):

Home PC: 32565 K-cycles.

Office PC: 15116 K-cycles.

In other words, the change from naive to optimized SIMD had almost no effect on one machine, but it made things run twice faster on the other machine.

By now you should know what to do next: check the disassembly. Recall that the naive version looked like this:

The optimized version (9b) however looks much better:

As expected the scalar loads (movss) and unpacking instructions (unpcklps) all vanished, replaced with a single “proper” SIMD load (movups). In both cases the last 4 instructions are similar, and implement the comparison / test / branch that didn’t change between the naive & optimized functions.

In other words, since the timings are similar, it looks like on the home PC the single movups load is as costly as the large block of 14 movss/unpcklps from the naive version.

I suppose that’s a lesson in itself: all instructions don’t have the same cost. The number of instructions is not a reliable way to evaluate performance.

At this point you could fire up a profiler like V-Tune which would probably reveal what’s going on (I didn’t try), or you could just observe, formulate a theory, and design an experiment to test the theory. You know, the scientific method.

Observe:

movups xmm0,xmmword ptr [eax+8]

There’s only one line to observe really so it doesn’t take a rocket scientist to spot the only potential issue here: that u is for unaligned, and it should ideally be an a, for aligned.

That is, we should have a movaps instead of a movups here. In intrinsics-speak, we’ve been using _mm_loadu_ps while we should ideally have used _mm_load_ps.

Could unaligned loads have such an impact? Well that’s our theory now. Let’s test it.

We used unaligned loads because we had no choice: the size of our AABB classes was 24 bytes (6 floats). To be able to use aligned loads, we’d need a multiple of 16 bytes.

So we could add some padding bytes to reach sizeof(AABB)==32, but that would consume more memory, which means also more cache misses.

Or we could just split our AABB in two separate classes, like this:

We would then have to allocate two internal arrays instead of one to store the bounds:

On one hand we would then have to compute two addresses within our inner loop instead of one, to load data from two different memory addresses instead of one before:

That’s likely to be slower. On the other hand we would then be able to use aligned loads for the YZ part, and that should be a win.

In some cases the quickest way to know is simply to try. The disassembly becomes:

No movaps!

But the compiler was able to merge the move directly in the cmp instruction, which is similar - it’s an aligned load.

And the results apparently confirm our theory:

Home PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 781500 K-cycles.
17426 K-cycles.
16272 K-cycles.
16223 K-cycles.
16239 K-cycles.
16224 K-cycles.
16226 K-cycles.
16289 K-cycles.
16248 K-cycles.
16448 K-cycles.
16240 K-cycles.
16224 K-cycles.
16398 K-cycles.
16251 K-cycles.
16485 K-cycles.
16239 K-cycles.
16225 K-cycles.
Complete test (box pruning): found 11715 intersections in 16223 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 781569 K-cycles.
15629 K-cycles.
14813 K-cycles.
14841 K-cycles.
14827 K-cycles.
14803 K-cycles.
14933 K-cycles.
14821 K-cycles.
14803 K-cycles.
15048 K-cycles.
14833 K-cycles.
14803 K-cycles.
14817 K-cycles.
14802 K-cycles.
14802 K-cycles.
14802 K-cycles.
14998 K-cycles.
Complete test (box pruning): found 11811 intersections in 14802 K-cycles.

Office PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 826531 K-cycles.
13634 K-cycles.
12709 K-cycles.
12707 K-cycles.
13063 K-cycles.
12858 K-cycles.
13824 K-cycles.
13357 K-cycles.
13502 K-cycles.
12863 K-cycles.
13041 K-cycles.
12711 K-cycles.
12932 K-cycles.
12711 K-cycles.
12709 K-cycles.
12928 K-cycles.
12851 K-cycles.
Complete test (box pruning): found 11715 intersections in 12707 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 813914 K-cycles.
13582 K-cycles.
12562 K-cycles.
12825 K-cycles.
13048 K-cycles.
12806 K-cycles.
12562 K-cycles.
12915 K-cycles.
12627 K-cycles.
13216 K-cycles.
12899 K-cycles.
13546 K-cycles.
13163 K-cycles.
12572 K-cycles.
12769 K-cycles.
12662 K-cycles.
12938 K-cycles.
Complete test (box pruning): found 11811 intersections in 12562 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

So the code on the home PC became about twice as fast as before, which makes it roughly as fast as what we got on the office PC for the previous version.

On the office PC, we only saved about 2500 K-Cycles, which means the aligned loads are indeed faster but not that much faster.

In any case, the important point is that our SIMD version is finally always faster than our best scalar version so far (version 8). Pheeeew!

Mystery solved?

Maybe, maybe not.

One question that pops up now is: why is the “safe” version measurably faster than the “unsafe” version on the home PC (14802 vs 16223), even though the safe version does a little bit more work? What’s going on here?

I was ready to move on but that stuff caught my eyes. I couldn’t explain it so I stayed a bit longer. Observed. And designed a new test to confirm the theory that the mystery had indeed been solved.

If the unaligned loads were indeed responsible for making the code 2X faster, then switching back to movups in version 9c (while keeping the separate classes) should also bring back the performance of version 9b – on the home PC in particular.

But. It. Did. Not.

On the home PC, changing the _mm_load_ps to _mm_loadu_ps (while keeping the separate classes, i.e. while keeping the data buffers 16-byte aligned) gave these results:

Safe version:

Complete test (box pruning): found 11811 intersections in 16513 K-cycles.

Unsafe version:

Complete test (box pruning): found 11715 intersections in 19180 K-cycles.

i.e. the cost of unaligned loads for the safe version was 16513 - 14802 = 1711 K-Cycles.

And the cost of unaligned loads for the unsafe version was 19180 - 16223 = 2957 K-Cycles.

That last number is pretty similar to the cost of unaligned loads on the office PC (as we mentioned, ~2500 K-Cycles). So that number is pretty consistent, and seems to indicate the real cost of movups vs movaps: roughly 2000 to 3000 K-Cycles, on both machines.

But then…

why did we go from ~32000 to ~16000 K-Cycles between versions 9b and 9c on the home PC? Where is that coming from, if not from movups?

The mystery was not solved, after all.

New questions, new theories, new tests. The only other change we did was the switch from “new” to “_aligned_malloc“. The SIMD_AABB_YZ has a force-inlined empty ctor, so this couldn’t be a case of calling a slow compiler-generated ctor a large number of times (this things happen). Just to test the theory that it had something to do with this change though, I switched from “new” to “_aligned_malloc” in version 9b (without splitting the classes).

And lo and behold, this change alone made version 9b go from ~32000 K-Cycles to ~18000 K-Cycles at home (for both safe & unsafe versions).

So that’s where the big performance drop was coming from!

After this line was identified, figuring out the “why” was a walk in the park. In short: the performance of movups depends on the alignment of the returned address.

In version 9b, on the home PC:

  • If the bounds array is aligned on a 4-bytes boundary, the function takes ~33000 K-Cycles.
  • If the bounds array is aligned on an 8-bytes boundary, the function takes ~19000 K-Cycles.
  • If the bounds array is aligned on a 16-bytes boundary, the function takes ~18000 K-Cycles.

So there is a huge hit for the 4-bytes boundary case, and we were previously unknowingly hitting it. Switching to _aligned_malloc, or using __declspec(align(N)) on the SIMD_AABB class, fixes the issue in version 9b.

Let’s recap for the home PC:

Instruction

Alignment

Version

Timings

movups

data 4-bytes aligned

single class (version9b)

33000

movups

data 8-bytes aligned

single class (version9b)

19000

movups

data 16-bytes aligned

single class (version9b)

18000

movups

data 16-bytes aligned

separate classes (version9c safe)

16000

movups

data 16-bytes aligned

separate classes (version9c unsafe)

19000

movaps

data 16-bytes aligned

separate classes (version9c safe)

14000

movaps

data 16-bytes aligned

separate classes (version9c unsafe)

16000

What this all means is that the initial theory was actually correct (the performance cost was indeed because of unaligned loads), but there was a subtlety in there: there’s the cost of the movups instruction itself compared to movaps (which is constant at ~2000 / 3000 K-Cycles) and the cost of the data-loading with movups (which is variable and sometimes considerably more expensive).

We also see that version 9b 16-bytes aligned can be faster than version 9c using movups (18000 vs 19000). That is probably because as we mentioned, version 9c needs to compute two addresses now.

And then finally, with movaps, the “safe” version 9c is still faster than the “unsafe” version 9c.

Why?

I still don’t know.

I would have to try V-Tune at that point. For now, I will be happy to admit that I don’t have an answer, and I don’t see what test I could add to find one.

In any case the performance difference is not large, and since the fastest version is also the safest, there is little motivation to go to the bottom of it.

But the main reason for not bothering is that, spoiler: that difference will vanish in the next post anyway.

What we learnt:

SIMD is hard. It took 3 versions for it to become faster than our scalar code.

The cost of unaligned loads can vary greatly depending on data alignment.

Data alignment has an impact on performance even with unaligned loads.

GitHub code for part 9c

Box pruning revisited - part 9b - better SIMD

February 14th, 2017

Part 9b – better SIMD

Last time we wrote an initial SIMD version that ended up slower than the scalar code. We then identified the _mm_set_ps intrinsics as a potential source of issues.

So today we want to replace _mm_set_ps with “real” SIMD loads: _mm_load_ps (for aligned loads) or _mm_loadu_ps (for unaligned loads). These are the intrinsics that each map to a single assembly instruction (respectively movaps and movups). They load 4 contiguous floats into a SIMD register, so we will need to reorder the data. If we go back to the initial C++ code and rearrange it so that all elements of the same box are on the same side, we get:

i.e. the a’s are all on the left side, and the b’s are all on the right side.

This is going to be tricky since we have a mix of “greater than” and “lesser than” comparisons in the middle. Let’s ignore it for now and focus on the loads.

For a given box, we want to load the y’s and z’s (both min and max) with a single load. Unfortunately the AABB class so far looks like this:

That is, switching back to plain floats:

With this format we cannot load the y’s and z’s at the same time, since they’re interleaved with the x’s. So, we change the format, and introduce this new AABB class:

It’s the same data as before, we just put the x’s first. That way the 4 floats we are interested in are stored contiguously in memory, and we can load them to an SSE register with a single assembly instruction. Which gives us this in-register layout, for two boxes:

__m128 box0 (a)

mMinY

mMinZ

mMaxY

mMaxZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Go back to the C++ code and you’ll see that we need to compare minimums of (a) to maximums of (b). So we need to reorder the elements for one of the two boxes.

Since one of the box (say box0) is constant for the duration of the loop, it makes more sense to reorder the constant box (only once), instead of doing the reordering inside the loop. This is done with a single shuffle instruction (_mm_shuffle_ps), and we end up with:

__m128 box0 (a)

mMaxY

mMaxZ

mMinY

mMinZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Just to see where we stand, let’s do a SIMD comparison (_mm_cmpgt_ps) between (a) and (b). This computes:

a.mMaxY > b.mMinY

a.mMaxZ > b.mMinZ

a.mMinY > b.mMaxY

a.mMinZ > b.mMaxZ

The two last lines are actually directly what we need:

a.mMinY > b.mMaxY => that’s (1)

a.mMinZ > b.mMaxZ => that’s (3)

The two other lines compute almost what we want, but not quite.

a.mMaxY > b.mMinY => that’s the opposite of (2)

a.mMaxZ > b.mMinZ => that’s the opposite of (4)

Well the opposite is not bad, it also gives us the information we need, just by flipping the result. The flipping is virtual and free: we are not going to actually do it. It just means that we test the movemask result against a different expected value. That is, if the previous expected value was, say, 1111b in binary, we can just flip two of the bits for which we get the opposite results, and test against, say 1100b instead. It gives us the same information, and the proper overlap test results…

…except in the equality case.

If a.mMaxY==b.mMinY for example:

  • the C++ code tests “a.mMaxY < b.mMinY” and returns 0, which is the desired reference result.
  • the SIMD code tests “a.mMaxY > b.mMinY” and also returns 0, which gives us a 1 after flipping.

Replacing the “greater than” comparison with “greater or equal” wouldn’t work, because what we would really need is a SIMD comparison that performs the former on two of the components, and the later on the two others. So, the way it stands, the SIMD code is equivalent to this:

Almost what we wanted!

And in fact, we could very well accept the differences and stop here.

After all, the equality case is often a bit fuzzy. Don’t forget that we are dealing with floats here. It is bad form to use exact equality when comparing floats. When using the box pruning function for the broadphase in a physics engine, the equality case is meaningless since the bounding boxes are constantly evolving, recomputed each frame, and there is always a bit of noise in the results due to the nature of floating point arithmetic. The results will depend on how the compiler reorganized the instructions, whether x87 or SIMD is used, it will depend on the internal FPU accuracy of x87 registers, i.e. the state of the FPU control word, and so on. Basically the case where the resulting bounding boxes are exactly touching is so ill-defined that it really doesn’t make any difference if the SIMD code uses “>” or “>=” on some of the components.

If we accept this limitation, we can run the test again and see the results:

Home PC:

Complete test (brute force): found 11811 intersections in 781912 K-cycles.
33554 K-cycles.
32514 K-cycles.
32803 K-cycles.
32498 K-cycles.
32486 K-cycles.
32487 K-cycles.
32499 K-cycles.
32488 K-cycles.
32669 K-cycles.
32672 K-cycles.
32855 K-cycles.
32537 K-cycles.
32504 K-cycles.
32477 K-cycles.
32674 K-cycles.
32505 K-cycles.
Complete test (box pruning): found 11725 intersections in 32477 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 808659 K-cycles.
15775 K-cycles.
15266 K-cycles.
15097 K-cycles.
15302 K-cycles.
15259 K-cycles.
15554 K-cycles.
15560 K-cycles.
16137 K-cycles.
15709 K-cycles.
15109 K-cycles.
15317 K-cycles.
15149 K-cycles.
15238 K-cycles.
15239 K-cycles.
15525 K-cycles.
15097 K-cycles.
Complete test (box pruning): found 11725 intersections in 15097 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b

32477

~2000

~5%

~3.04

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500