Archive for February 14th, 2017

Box pruning revisited - part 11 - the last branch

Tuesday, February 14th, 2017

Part 11 – the last branch

The SIMD version gave us a good speedup. The best scalar version was also quite successful, thanks to the removal of several hard-to-predict branches.

Looking at the remaining code with this in mind, one more branch jumps to the eyes, just before the SIMD overlap test:

This ensures that we never test a box against itself, but I don’t quite remember why it’s there. The way the code is written, it cannot happen. So this test can just be removed. Since it never happens, the branch is likely to be always correctly predicted, and thus the expected gains are small, if any.

But still, worth doing - it’s just one line.

And the results are actually interesting:

Home PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 781447 K-cycles.
15224 K-cycles.
14386 K-cycles.
14633 K-cycles.
14387 K-cycles.
14376 K-cycles.
14392 K-cycles.
14376 K-cycles.
14373 K-cycles.
14579 K-cycles.
14400 K-cycles.
14374 K-cycles.
14388 K-cycles.
14373 K-cycles.
14372 K-cycles.
14471 K-cycles.
14404 K-cycles.
Complete test (box pruning): found 11715 intersections in 14372 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 781701 K-cycles.
15306 K-cycles.
14533 K-cycles.
14513 K-cycles.
14731 K-cycles.
14552 K-cycles.
14512 K-cycles.
14528 K-cycles.
14514 K-cycles.
14514 K-cycles.
14528 K-cycles.
14535 K-cycles.
14513 K-cycles.
14526 K-cycles.
14515 K-cycles.
14515 K-cycles.
14621 K-cycles.
Complete test (box pruning): found 11811 intersections in 14512 K-cycles.

Office PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 824867 K-cycles.
13618 K-cycles.
12900 K-cycles.
12573 K-cycles.
12957 K-cycles.
12570 K-cycles.
12911 K-cycles.
12572 K-cycles.
12573 K-cycles.
12588 K-cycles.
13153 K-cycles.
13447 K-cycles.
13212 K-cycles.
13429 K-cycles.
13214 K-cycles.
13527 K-cycles.
13229 K-cycles.
Complete test (box pruning): found 11715 intersections in 12570 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 816240 K-cycles.
13227 K-cycles.
13013 K-cycles.
12621 K-cycles.
14329 K-cycles.
12624 K-cycles.
13009 K-cycles.
13533 K-cycles.
13171 K-cycles.
12881 K-cycles.
13181 K-cycles.
13265 K-cycles.
13248 K-cycles.
13245 K-cycles.
13242 K-cycles.
13267 K-cycles.
12611 K-cycles.
Complete test (box pruning): found 11811 intersections in 12611 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

(Version10)

(16667)

-

-

-

Version11 - unsafe

14372

~1800

~11%

~6.87

Version11 - safe

14512

~200

~2%

~6.80

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

(Version10)

(15648)

-

-

-

Version11 - unsafe

12570

~100

~1%

~7.38

Version11 - safe

12611

-

-

~7.36

Version 11 is compared to version 9c here.

No difference on the office PC, but on the home PC the safe & unsafe versions are suddenly the same speed. Hmmm. We’ll never solve that particular mystery then. Oh well. All is good now at least.

Now what? Are we done?

We are about 7x faster than when we started, which is pretty good!

…but not quite the order of magnitude we wanted to reach.

At this point there isn’t much left in the inner loop, at least on the C++ side of things. So let’s check the new disassembly. The inner loop we’re interested in is:

And the corresponding disassembly is simple enough to isolate:

//SIMD_OVERLAP_TEST(BoxListYZ[Index1])
003E2FB3 cmpltps xmm1,xmmword ptr [edi]
003E2FB7 movmskps eax,xmm1
003E2FBA cmp eax,0Ch
003E2FBD jne CompleteBoxPruning+2D1h (03E3011h)
//pairs.Add(RIndex0).Add(Remap[Index1]);
003E2FBF mov eax,dword ptr [esi+4]
003E2FC2 cmp eax,dword ptr [esi]
003E2FC4 jne CompleteBoxPruning+28Fh (03E2FCFh)
003E2FC6 push 1
003E2FC8 mov ecx,esi
003E2FCA call IceCore::Container::Resize (03E1350h)
003E2FCF mov ecx,dword ptr [esi+4]
003E2FD2 mov eax,dword ptr [esi+8]
003E2FD5 mov edx,dword ptr [esp+34h]
003E2FD9 mov dword ptr [eax+ecx*4],edx
003E2FDC inc dword ptr [esi+4]
003E2FDF mov edx,dword ptr [esp+20h]
003E2FE3 mov eax,dword ptr [esi+4]
003E2FE6 mov ecx,dword ptr [edx]
003E2FE8 mov dword ptr [esp+30h],ecx
003E2FEC cmp eax,dword ptr [esi]
003E2FEE jne CompleteBoxPruning+2B9h (03E2FF9h)
003E2FF0 push 1
003E2FF2 mov ecx,esi
003E2FF4 call IceCore::Container::Resize (03E1350h)
003E2FF9 mov ecx,dword ptr [esi+4]
003E2FFC mov eax,dword ptr [esi+8]
003E2FFF mov edx,dword ptr [esp+30h]
003E3003 mov dword ptr [eax+ecx*4],edx
003E3006 inc dword ptr [esi+4]
003E3009 mov ecx,dword ptr [esp+28h]
003E300D mov edx,dword ptr [esp+20h]
// while(BoxListX[Index1].mMinX<=MaxLimit)
003E3011 movss xmm0,dword ptr [esp+38h]
003E3017 movaps xmm1,xmmword ptr [esp+40h]
003E301C add ecx,8
003E301F add edx,4
003E3022 add edi,10h
003E3025 comiss xmm0,dword ptr [ecx]
003E3028 mov dword ptr [esp+20h],edx
003E302C mov dword ptr [esp+28h],ecx
003E3030 jae CompleteBoxPruning+273h (03E2FB3h)

I added two blank lines and color-coded the whole thing to clearly delimitate three blocks of code.

Roughly, the first block is the SIMD overlap test, the second block is for writing the pair indices when an overlap is found, and the last block is the while loop (easily identified by the comiss instruction).

It’s… not great.

Where do I start?

———————

In the first two lines, since the xmmword ptr [edi] is the actual load of BoxListYZ[Index1], writing it this way means that xmm1 contains our constant box for the loop (the preloaded Box0YZ in the C++ code):

003E2FB3 cmpltps xmm1,xmmword ptr [edi]

003E2FB7 movmskps eax,xmm1

But the cmpltps instruction is always going to destroy the contents of xmm1. So this register will need to be reloaded for the next iteration. And indeed, that’s what happens in the third block:

003E3017 movaps xmm1,xmmword ptr [esp+40h]

We also confirm that edi is the BoxListYZ array, thanks to:

003E3022 add edi,10h

10h == 16 == sizeof(SIMD_AABB_YZ), makes sense.

Question 1: why write the code in such a way that it destroys the constant box?

Question 2: why reload the constant box from the stack? Why not keeping it in a register? The whole loop only uses two xmm registers (!) so it’s not like there’s a shortage of them.

———————

In the third block, we have this:

003E301C add ecx,8

And then:

003E3025 comiss xmm0,dword ptr [ecx]

Which means that ecx is the BoxListX array (with sizeof(SIMD_AABB_X)==8), and dword ptr [ecx] (a scalar load) is the assembly for “BoxListX[Index1].mMinX” in the C++ code.

Makes sense.

But then it also means that xmm0 is “MaxLimit“, which gets reloaded just before from the stack:

003E3011 movss xmm0,dword ptr [esp+38h]

Doesn’t make sense.

Question 3: Why do you reload it from the stack each time? It’s a constant for the whole loop. There are free available xmm registers. Just put it in xmm2, keep it there, and the “movss” instruction above just vanishes.

———————

In the third block again, we have this:

003E301F add edx,4

003E3028 mov dword ptr [esp+20h],edx

003E302C mov dword ptr [esp+28h],ecx

We saw ecx before, it’s the BoxListX array.

Question 4: why do you push it to the stack? Why don’t you keep it in a register?

———————

Then there’s edx, incremented by 4 and also pushed to the stack. If you look for [esp+20h] to check where we read it again, you find two places:

003E2FDF mov edx,dword ptr [esp+20h]

And:

003E300D mov edx,dword ptr [esp+20h]

The first place happens after the first “push back” call, and corresponds to this C++ code:

Add(Remap[Index1])

The second place happens after the second “push back” call, and simply restores the edx register, which is not actually used when the SIMD-overlap test doesn’t find an overlap.

You guessed it, edx is Index1, or rather Remap[Index1] as you can see from these lines in the second block:

003E2FDF mov edx,dword ptr [esp+20h]

003E2FE6 mov ecx,dword ptr [edx]

Right. Ok. But then…

Question 5: why do you save edx to the stack all the time (address 003E3028)? Why don’t you just save and restore it only when an overlap actually occurs?

Question 6: same question for ecx.

———————

As for the second block, it looks fine overall. Still, I know some people will ask this one:

Question 7: why do you use a custom vector class instead of std::vector?

Ok, let’s quickly deal with that one first, it won’t take long. Replace the custom container with std::vector and you get:

Home PC:

Unsafe:

Complete test (brute force): found 11811 intersections in 851569 K-cycles.
19727 K-cycles.
18913 K-cycles.
18881 K-cycles.
18879 K-cycles.
18908 K-cycles.
18901 K-cycles.
18891 K-cycles.
18882 K-cycles.
18878 K-cycles.
19110 K-cycles.
18902 K-cycles.
18901 K-cycles.
18881 K-cycles.
18895 K-cycles.
18904 K-cycles.
18882 K-cycles.
Complete test (box pruning): found 11715 intersections in 18878 K-cycles.

Office PC:

Unsafe:

Complete test (brute force): found 11811 intersections in 888075 K-cycles.
19200 K-cycles.
18307 K-cycles.
18584 K-cycles.
18729 K-cycles.
18306 K-cycles.
18647 K-cycles.
18571 K-cycles.
18306 K-cycles.
18767 K-cycles.
18551 K-cycles.
18420 K-cycles.
18659 K-cycles.
18530 K-cycles.
18365 K-cycles.
18491 K-cycles.
18310 K-cycles.
Complete test (box pruning): found 11715 intersections in 18306 K-cycles.

That is:

Home PC

Timings

Version11 - unsafe - ICE container

14372

Version11 - unsafe - std::vector

18878

Office PC

Timings

Version11 - unsafe - ICE container

12570

Version11 - unsafe - std::vector

18306

Just look at the numbers. That’s why I’m not using std::vector. It has never been able to implement a simple “push_back” correctly. It’s 2017 and std::vector still cannot match the first basic C++ class I wrote back in 1999.

There is no need to tell me about STLPort or EASTL or whatever: I’ve heard it all. If I cannot rely on the version included with Visual Studio, if I have to switch to some external library, I’d rather use my own. At least I know it’s properly implemented.

Just for “fun”, with the custom container the second block above is about 27 lines of code. Right?

Ok, now take a deep breath, here’s the equivalent second block with std::vector:

011836B0 mov ecx,dword ptr [esi+4]
011836B3 lea eax,[esp+28h]
011836B7 cmp eax,ecx
011836B9 jae CompleteBoxPruning+316h (01183736h)
011836BB mov edi,dword ptr [esi]
011836BD cmp edi,eax
011836BF ja CompleteBoxPruning+316h (01183736h)
011836C1 mov edx,dword ptr [esi+8]
011836C4 sub eax,edi
011836C6 sar eax,2
011836C9 mov dword ptr [esp+40h],eax
011836CD cmp ecx,edx
011836CF jne CompleteBoxPruning+302h (01183722h)
011836D1 mov eax,edx
011836D3 sub eax,ecx
011836D5 sar eax,2
011836D8 cmp eax,1
011836DB jae CompleteBoxPruning+302h (01183722h)
011836DD sub ecx,edi
011836DF sar ecx,2
011836E2 mov eax,3FFFFFFFh
011836E7 sub eax,ecx
011836E9 cmp eax,1
011836EC jb CompleteBoxPruning+4F9h (01183919h)
011836F2 inc ecx
011836F3 sub edx,edi
011836F5 sar edx,2
011836F8 mov dword ptr [esp+3Ch],ecx
011836FC mov ecx,edx
011836FE shr ecx,1
01183700 mov eax,3FFFFFFFh
01183705 sub eax,ecx
01183707 cmp eax,edx
01183709 jae CompleteBoxPruning+2EFh (0118370Fh)
0118370B xor edx,edx
0118370D jmp CompleteBoxPruning+2F1h (01183711h)
0118370F add edx,ecx
01183711 cmp edx,dword ptr [esp+3Ch]
01183715 mov ecx,esi
01183717 cmovb edx,dword ptr [esp+3Ch]
0118371C push edx
0118371D call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183722 mov edx,dword ptr [esi+4]
01183725 test edx,edx
01183727 je CompleteBoxPruning+37Dh (0118379Dh)
01183729 mov eax,dword ptr [esi]
0118372B mov ecx,dword ptr [esp+40h]
0118372F mov eax,dword ptr [eax+ecx*4]
01183732 mov dword ptr [edx],eax
01183734 jmp CompleteBoxPruning+37Dh (0118379Dh)
01183736 mov edx,dword ptr [esi+8]
01183739 cmp ecx,edx
0118373B jne CompleteBoxPruning+370h (01183790h)
0118373D mov eax,edx
0118373F sub eax,ecx
01183741 sar eax,2
01183744 cmp eax,1
01183747 jae CompleteBoxPruning+370h (01183790h)
01183749 mov edi,dword ptr [esi]
0118374B sub ecx,edi
0118374D sar ecx,2
01183750 mov eax,3FFFFFFFh
01183755 sub eax,ecx
01183757 cmp eax,1
0118375A jb CompleteBoxPruning+4F9h (01183919h)
01183760 inc ecx
01183761 sub edx,edi
01183763 sar edx,2
01183766 mov dword ptr [esp+40h],ecx
0118376A mov ecx,edx
0118376C shr ecx,1
0118376E mov eax,3FFFFFFFh
01183773 sub eax,ecx
01183775 cmp eax,edx
01183777 jae CompleteBoxPruning+35Dh (0118377Dh)
01183779 xor edx,edx
0118377B jmp CompleteBoxPruning+35Fh (0118377Fh)
0118377D add edx,ecx
0118377F cmp edx,dword ptr [esp+40h]
01183783 mov ecx,esi
01183785 cmovb edx,dword ptr [esp+40h]
0118378A push edx
0118378B call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183790 mov eax,dword ptr [esi+4]
01183793 test eax,eax
01183795 je CompleteBoxPruning+37Dh (0118379Dh)
01183797 mov ecx,dword ptr [esp+44h]
0118379B mov dword ptr [eax],ecx
0118379D add dword ptr [esi+4],4
011837A1 mov ecx,dword ptr [esi+4]
011837A4 mov eax,dword ptr [esp+14h]
011837A8 cmp eax,ecx
011837AA jae CompleteBoxPruning+405h (01183825h)
011837AC mov edi,dword ptr [esi]
011837AE cmp edi,eax
011837B0 ja CompleteBoxPruning+405h (01183825h)
011837B2 mov edx,dword ptr [esi+8]
011837B5 sub eax,edi
011837B7 sar eax,2
011837BA mov dword ptr [esp+3Ch],eax
011837BE cmp ecx,edx
011837C0 jne CompleteBoxPruning+3F3h (01183813h)
011837C2 mov eax,edx
011837C4 sub eax,ecx
011837C6 sar eax,2
011837C9 cmp eax,1
011837CC jae CompleteBoxPruning+3F3h (01183813h)
011837CE sub ecx,edi
011837D0 sar ecx,2
011837D3 mov eax,3FFFFFFFh
011837D8 sub eax,ecx
011837DA cmp eax,1
011837DD jb CompleteBoxPruning+4F9h (01183919h)
011837E3 inc ecx
011837E4 sub edx,edi
011837E6 sar edx,2
011837E9 mov dword ptr [esp+40h],ecx
011837ED mov ecx,edx
011837EF shr ecx,1
011837F1 mov eax,3FFFFFFFh
011837F6 sub eax,ecx
011837F8 cmp eax,edx
011837FA jae CompleteBoxPruning+3E0h (01183800h)
011837FC xor edx,edx
011837FE jmp CompleteBoxPruning+3E2h (01183802h)
01183800 add edx,ecx
01183802 cmp edx,dword ptr [esp+40h]
01183806 mov ecx,esi
01183808 cmovb edx,dword ptr [esp+40h]
0118380D push edx
0118380E call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
01183813 mov ecx,dword ptr [esi+4]
01183816 test ecx,ecx
01183818 je CompleteBoxPruning+46Eh (0118388Eh)
0118381A mov eax,dword ptr [esi]
0118381C mov edx,dword ptr [esp+3Ch]
01183820 mov eax,dword ptr [eax+edx*4]
01183823 jmp CompleteBoxPruning+46Ch (0118388Ch)
01183825 mov edx,dword ptr [esi+8]
01183828 cmp ecx,edx
0118382A jne CompleteBoxPruning+463h (01183883h)
0118382C mov eax,edx
0118382E sub eax,ecx
01183830 sar eax,2
01183833 cmp eax,1
01183836 jae CompleteBoxPruning+45Fh (0118387Fh)
01183838 mov edi,dword ptr [esi]
0118383A sub ecx,edi
0118383C sar ecx,2
0118383F mov eax,3FFFFFFFh
01183844 sub eax,ecx
01183846 cmp eax,1
01183849 jb CompleteBoxPruning+4F9h (01183919h)
0118384F inc ecx
01183850 sub edx,edi
01183852 sar edx,2
01183855 mov dword ptr [esp+40h],ecx
01183859 mov ecx,edx
0118385B shr ecx,1
0118385D mov eax,3FFFFFFFh
01183862 sub eax,ecx
01183864 cmp eax,edx
01183866 jae CompleteBoxPruning+44Ch (0118386Ch)
01183868 xor edx,edx
0118386A jmp CompleteBoxPruning+44Eh (0118386Eh)
0118386C add edx,ecx
0118386E cmp edx,dword ptr [esp+40h]
01183872 mov ecx,esi
01183874 cmovb edx,dword ptr [esp+40h]
01183879 push edx
0118387A call std::vector<unsigned int,std::allocator<unsigned int> >::_Reallocate (011828F0h)
0118387F mov eax,dword ptr [esp+14h]
01183883 mov ecx,dword ptr [esi+4]
01183886 test ecx,ecx
01183888 je CompleteBoxPruning+46Eh (0118388Eh)
0118388A mov eax,dword ptr [eax]
0118388C mov dword ptr [ecx],eax
0118388E mov edx,dword ptr [esp+2Ch]
01183892 mov ecx,dword ptr [esp+14h]
01183896 add dword ptr [esi+4],4

That’s 180 lines of code. Instead of 27. It’s just ridiculous.

That’s all I have to say about std::vector.

———————

Now that this interlude is over, let’s go back to the sane version presented before.

The most commonly executed codepath (when an overlap is not found) has 13 instructions (it corresponds to the first block + the last block). And for 13 instructions we raised 6 potential issues. That’s a lot of questions for such a small piece of code.

Of course, we might be missing something. The compiler might know more than us. Our analysis might be too naive.

Perhaps.

Perhaps not.

How do you answer that?

As before: the scientific way. We observed. Now we need to come up with theories and tests to discern between facts and fiction.

———————

Theory: the compiler is not recent enough. A more recent compiler would produce better code.

Yeah. I heard that one before…. for every version of Visual Studio since VC6… But fair enough, it’s easy to test. In fact let’s try both VC10 (the SSE code generation has changed quite a bit between VC10 and VC11), and then VC14. VC10 gives:

00BE30FA movaps xmm1,xmmword ptr [ebx]
00BE30FD movaps xmm2,xmmword ptr [esp+40h]
00BE3102 cmpltps xmm2,xmm1
00BE3106 movmskps eax,xmm2
00BE3109 cmp eax,0Ch
00BE310C jne CompleteBoxPruning+3FEh (0BE315Eh)
00BE310E mov ecx,dword ptr [esi+4]
00BE3111 cmp ecx,dword ptr [esi]
00BE3113 jne CompleteBoxPruning+3BEh (0BE311Eh)
00BE3115 push 1
00BE3117 mov ecx,esi
00BE3119 call IceCore::Container::Resize (0BE1240h)
00BE311E mov edx,dword ptr [esi+4]
00BE3121 mov eax,dword ptr [esi+8]
00BE3124 mov ecx,dword ptr [esp+30h]
00BE3128 mov dword ptr [eax+edx*4],ecx
00BE312B inc dword ptr [esi+4]
00BE312E mov eax,dword ptr [esi+4]
00BE3131 mov edx,dword ptr [edi]
00BE3133 mov dword ptr [esp+34h],edx
00BE3137 cmp eax,dword ptr [esi]
00BE3139 jne CompleteBoxPruning+3E4h (0BE3144h)
00BE313B push 1
00BE313D mov ecx,esi
00BE313F call IceCore::Container::Resize (0BE1240h)
00BE3144 mov eax,dword ptr [esi+4]
00BE3147 mov ecx,dword ptr [esi+8]
00BE314A mov edx,dword ptr [esp+34h]
00BE314E movss xmm0,dword ptr [esp+38h]
00BE3154 mov dword ptr [ecx+eax*4],edx
00BE3157 inc dword ptr [esi+4]
00BE315A mov ecx,dword ptr [esp+3Ch]
00BE315E mov eax,dword ptr [esp+28h]
00BE3162 add eax,8
00BE3165 add ebx,10h
00BE3168 add edi,4
00BE316B comiss xmm0,dword ptr [eax]
00BE316E mov dword ptr [esp+28h],eax
00BE3172 jae CompleteBoxPruning+39Ah (0BE30FAh)

There are still 13 instructions overall in the main codepath, with a few twists.

It seems to use three xmm registers instead of two, which looks better.

———————

This is the same as in VC11, but using two instructions instead of one:

00BE30FA movaps xmm1,xmmword ptr [ebx]

00BE3102 cmpltps xmm2,xmm1

xmm2 is still destroyed all the time, and reloaded from the stack:

00BE30FD movaps xmm2,xmmword ptr [esp+40h]

The only difference is that it’s reloaded in our first block, rather than in the third block before. But it’s really all the same otherwise - even the stack offset is the same.

Questions 1 and 2: same.

———————

VC10 seems to manage “MaxLimit” better. It’s kept in xmm0 and only reloaded when an overlap occurs:

00BE314E movss xmm0,dword ptr [esp+38h]

Well done.

Question 3: VC10 wins.

———————

In the third block, what used to be ecx is now eax:

00BE3162 add eax,8

00BE316B comiss xmm0,dword ptr [eax]

00BE316E mov dword ptr [esp+28h],eax

And it’s saved to the stack all the time, like in VC11.

As for edi (previously edx), it is not saved to the stack anymore. Yay!

But eax (previously ecx) is now reloaded from the stack all the time:

00BE315E mov eax,dword ptr [esp+28h]

…while it wasn’t before. Oh.

Question 4: VC11 wins.

Question 5 and 6: the same overall.

———————

So from just looking at this short snippet it is unclear which compiler does a better job. But then if you look at the code before that inner loop, you find this:

0BE2F1B fstp dword ptr [eax-58h]
00BE2F1E fld dword ptr [edx+8]
00BE2F21 fstp dword ptr [eax-54h]
00BE2F24 fld dword ptr [edx+10h]
00BE2F27 fstp dword ptr [eax-50h]
00BE2F2A fld dword ptr [edx+14h]
00BE2F2D fstp dword ptr [eax-4Ch]
00BE2F30 mov edx,dword ptr [esi-14h]
00BE2F33 mov dword ptr [edi-14h],edx
00BE2F36 lea edx,[edx+edx*2]
00BE2F39 fld dword ptr [ebx+edx*8]
00BE2F3C lea edx,[ebx+edx*8]
00BE2F3F fstp dword ptr [ecx-2Ch]
00BE2F42 fld dword ptr [edx+0Ch]
00BE2F45 fstp dword ptr [ecx-28h]
00BE2F48 fld dword ptr [edx+4]
00BE2F4B fstp dword ptr [eax-48h]
00BE2F4E fld dword ptr [edx+8]
00BE2F51 fstp dword ptr [eax-44h]
00BE2F54 fld dword ptr [edx+10h]
00BE2F57 fstp dword ptr [eax-40h]
00BE2F5A fld dword ptr [edx+14h]
00BE2F5D fstp dword ptr [eax-3Ch]
00BE2F60 mov edx,dword ptr [esi-10h]
00BE2F63 mov dword ptr [edi-10h],edx
00BE2F66 lea edx,[edx+edx*2]
00BE2F69 fld dword ptr [ebx+edx*8]
00BE2F6C lea edx,[ebx+edx*8]
00BE2F6F fstp dword ptr [ecx-24h]
00BE2F72 fld dword ptr [edx+0Ch]
00BE2F75 fstp dword ptr [ecx-20h]
00BE2F78 fld dword ptr [edx+4]
00BE2F7B fstp dword ptr [eax-38h]
00BE2F7E fld dword ptr [edx+8]
00BE2F81 fstp dword ptr [eax-34h]
00BE2F84 fld dword ptr [edx+10h]
00BE2F87 fstp dword ptr [eax-30h]
00BE2F8A fld dword ptr [edx+14h]
00BE2F8D fstp dword ptr [eax-2Ch]
00BE2F90 mov edx,dword ptr [esi-0Ch]
00BE2F93 mov dword ptr [edi-0Ch],edx
00BE2F96 lea edx,[edx+edx*2]
00BE2F99 fld dword ptr [ebx+edx*8]
00BE2F9C lea edx,[ebx+edx*8]
00BE2F9F fstp dword ptr [ecx-1Ch]
00BE2FA2 fld dword ptr [edx+0Ch]
00BE2FA5 fstp dword ptr [ecx-18h]
00BE2FA8 fld dword ptr [edx+4]
00BE2FAB fstp dword ptr [eax-28h]
00BE2FAE fld dword ptr [edx+8]
00BE2FB1 fstp dword ptr [eax-24h]
00BE2FB4 fld dword ptr [edx+10h]
00BE2FB7 fstp dword ptr [eax-20h]
00BE2FBA fld dword ptr [edx+14h]
00BE2FBD fstp dword ptr [eax-1Ch]

Yikes!!

Ok: VC10 still used x87 instructions here and there. VC11 did not.

Forget that inner loop analysis: advantage VC11, big time. The benchmark results confirm this, regardless of the instructions count:

Office PC / unsafe version:

Complete test (brute force): found 11811 intersections in 891760 K-cycles.
19379 K-cycles.
18483 K-cycles.
18943 K-cycles.
18093 K-cycles.
18321 K-cycles.
18417 K-cycles.
18078 K-cycles.
19035 K-cycles.
18500 K-cycles.
18092 K-cycles.
18329 K-cycles.
18092 K-cycles.
17899 K-cycles.
18524 K-cycles.
18085 K-cycles.
19308 K-cycles.
Complete test (box pruning): found 11715 intersections in 17899 K-cycles.

=> Much slower than what we got with VC11.

———————

So, VC14 then? Let’s have a look:

00F32F33 cmpltps xmm1,xmmword ptr [edi]
00F32F37 movmskps eax,xmm1
00F32F3A cmp eax,0Ch
00F32F3D jne CompleteBoxPruning+2D1h (0F32F91h)
00F32F3F mov eax,dword ptr [esi+4]
00F32F42 cmp eax,dword ptr [esi]
00F32F44 jne CompleteBoxPruning+28Fh (0F32F4Fh)
00F32F46 push 1
00F32F48 mov ecx,esi
00F32F4A call IceCore::Container::Resize (0F312D0h)
00F32F4F mov ecx,dword ptr [esi+4]
00F32F52 mov eax,dword ptr [esi+8]
00F32F55 mov edx,dword ptr [esp+34h]
00F32F59 mov dword ptr [eax+ecx*4],edx
00F32F5C mov edx,dword ptr [esp+20h]
00F32F60 inc dword ptr [esi+4]
00F32F63 mov eax,dword ptr [esi+4]
00F32F66 mov ecx,dword ptr [edx]
00F32F68 mov dword ptr [esp+30h],ecx
00F32F6C cmp eax,dword ptr [esi]
00F32F6E jne CompleteBoxPruning+2B9h (0F32F79h)
00F32F70 push 1
00F32F72 mov ecx,esi
00F32F74 call IceCore::Container::Resize (0F312D0h)
00F32F79 mov ecx,dword ptr [esi+4]
00F32F7C mov eax,dword ptr [esi+8]
00F32F7F mov edx,dword ptr [esp+30h]
00F32F83 mov dword ptr [eax+ecx*4],edx
00F32F86 inc dword ptr [esi+4]
00F32F89 mov ecx,dword ptr [esp+28h]
00F32F8D mov edx,dword ptr [esp+20h]
00F32F91 movss xmm0,dword ptr [esp+38h]
00F32F97 add ecx,8
00F32F9A movaps xmm1,xmmword ptr [esp+40h]
00F32F9F add edx,4
00F32FA2 add edi,10h
00F32FA5 mov dword ptr [esp+20h],edx
00F32FA9 mov dword ptr [esp+28h],ecx
00F32FAD comiss xmm0,dword ptr [ecx]
00F32FB0 jae CompleteBoxPruning+273h (0F32F33h)

This one will be quick: some instructions have been re-ordered in the third block, but other than that it’s exactly the same code as VC11’s. And pretty much exactly the same performance.

So, nope, a newer compiler doesn’t solve the issues.

But are they really issues? There’s more to performance than the instructions count, as we saw before. Are we chasing a red herring?

To find our answers, we must go closer to the metal. And that’s what we will do next time.

———————

What we learnt:

Different compilers will give different results.

Compilers are getting better all the time, but they’re still not perfect.

std::vector sucks.

GitHub code for part 11

Box pruning revisited - part 10 - integer SIMD

Tuesday, February 14th, 2017

Part 10 – integer SIMD

Interlude:

Similar to what we did in version 7, and in sake of completeness, I tried using integer comparisons in the SIMD version as well.

The changes are straightforward: encode floats as integers like we did in version 7, replace SIMD floating-point intrinsics with SIMD integer intrinsics.

There are no special traps here, and not much to report, because at the end of the day the results are slower:

Home PC:

Complete test (brute force): found 11811 intersections in 781780 K-cycles.
17454 K-cycles.
16681 K-cycles.
16669 K-cycles.
17145 K-cycles.
16703 K-cycles.
16683 K-cycles.
16668 K-cycles.
16667 K-cycles.
16862 K-cycles.
16707 K-cycles.
16670 K-cycles.
16689 K-cycles.
16668 K-cycles.
16949 K-cycles.
16710 K-cycles.
16667 K-cycles.
Complete test (box pruning): found 11715 intersections in 16667 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810906 K-cycles.
16607 K-cycles.
15927 K-cycles.
15648 K-cycles.
15971 K-cycles.
15648 K-cycles.
15960 K-cycles.
15742 K-cycles.
15990 K-cycles.
15837 K-cycles.
15741 K-cycles.
15970 K-cycles.
15651 K-cycles.
16247 K-cycles.
15649 K-cycles.
15834 K-cycles.
15738 K-cycles.
Complete test (box pruning): found 11715 intersections in 15648 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Version10

(16667)

-

-

-

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

Version10

(15648)

-

-

-

What we learnt:

Don’t bother with integer comparisons anymore.

We are still using integer SIMD in PhysX for this, so it appears that the PhysX version is sub-optimal. Expect some performance gains in PhysX 3.5.

GitHub code for part 10

Box pruning revisited - part 9c - data alignment

Tuesday, February 14th, 2017

Part 9c – data alignment

Last time we were left with a bit of a mystery. Our optimizations produced consistent results so far, on both machines the code was benchmarked on.

However things changed when switching to SIMD.

We have a naive SIMD implementation (9a) producing these results:

Home PC: 34486 K-cycles.

Office PC: 31864 K-cycles.

And an optimized SIMD implementation (9b) producing these results (for the “safe” version, but timings are similar for the unsafe one):

Home PC: 32565 K-cycles.

Office PC: 15116 K-cycles.

In other words, the change from naive to optimized SIMD had almost no effect on one machine, but it made things run twice faster on the other machine.

By now you should know what to do next: check the disassembly. Recall that the naive version looked like this:

The optimized version (9b) however looks much better:

As expected the scalar loads (movss) and unpacking instructions (unpcklps) all vanished, replaced with a single “proper” SIMD load (movups). In both cases the last 4 instructions are similar, and implement the comparison / test / branch that didn’t change between the naive & optimized functions.

In other words, since the timings are similar, it looks like on the home PC the single movups load is as costly as the large block of 14 movss/unpcklps from the naive version.

I suppose that’s a lesson in itself: all instructions don’t have the same cost. The number of instructions is not a reliable way to evaluate performance.

At this point you could fire up a profiler like V-Tune which would probably reveal what’s going on (I didn’t try), or you could just observe, formulate a theory, and design an experiment to test the theory. You know, the scientific method.

Observe:

movups xmm0,xmmword ptr [eax+8]

There’s only one line to observe really so it doesn’t take a rocket scientist to spot the only potential issue here: that u is for unaligned, and it should ideally be an a, for aligned.

That is, we should have a movaps instead of a movups here. In intrinsics-speak, we’ve been using _mm_loadu_ps while we should ideally have used _mm_load_ps.

Could unaligned loads have such an impact? Well that’s our theory now. Let’s test it.

We used unaligned loads because we had no choice: the size of our AABB classes was 24 bytes (6 floats). To be able to use aligned loads, we’d need a multiple of 16 bytes.

So we could add some padding bytes to reach sizeof(AABB)==32, but that would consume more memory, which means also more cache misses.

Or we could just split our AABB in two separate classes, like this:

We would then have to allocate two internal arrays instead of one to store the bounds:

On one hand we would then have to compute two addresses within our inner loop instead of one, to load data from two different memory addresses instead of one before:

That’s likely to be slower. On the other hand we would then be able to use aligned loads for the YZ part, and that should be a win.

In some cases the quickest way to know is simply to try. The disassembly becomes:

No movaps!

But the compiler was able to merge the move directly in the cmp instruction, which is similar - it’s an aligned load.

And the results apparently confirm our theory:

Home PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 781500 K-cycles.
17426 K-cycles.
16272 K-cycles.
16223 K-cycles.
16239 K-cycles.
16224 K-cycles.
16226 K-cycles.
16289 K-cycles.
16248 K-cycles.
16448 K-cycles.
16240 K-cycles.
16224 K-cycles.
16398 K-cycles.
16251 K-cycles.
16485 K-cycles.
16239 K-cycles.
16225 K-cycles.
Complete test (box pruning): found 11715 intersections in 16223 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 781569 K-cycles.
15629 K-cycles.
14813 K-cycles.
14841 K-cycles.
14827 K-cycles.
14803 K-cycles.
14933 K-cycles.
14821 K-cycles.
14803 K-cycles.
15048 K-cycles.
14833 K-cycles.
14803 K-cycles.
14817 K-cycles.
14802 K-cycles.
14802 K-cycles.
14802 K-cycles.
14998 K-cycles.
Complete test (box pruning): found 11811 intersections in 14802 K-cycles.

Office PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 826531 K-cycles.
13634 K-cycles.
12709 K-cycles.
12707 K-cycles.
13063 K-cycles.
12858 K-cycles.
13824 K-cycles.
13357 K-cycles.
13502 K-cycles.
12863 K-cycles.
13041 K-cycles.
12711 K-cycles.
12932 K-cycles.
12711 K-cycles.
12709 K-cycles.
12928 K-cycles.
12851 K-cycles.
Complete test (box pruning): found 11715 intersections in 12707 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 813914 K-cycles.
13582 K-cycles.
12562 K-cycles.
12825 K-cycles.
13048 K-cycles.
12806 K-cycles.
12562 K-cycles.
12915 K-cycles.
12627 K-cycles.
13216 K-cycles.
12899 K-cycles.
13546 K-cycles.
13163 K-cycles.
12572 K-cycles.
12769 K-cycles.
12662 K-cycles.
12938 K-cycles.
Complete test (box pruning): found 11811 intersections in 12562 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

So the code on the home PC became about twice as fast as before, which makes it roughly as fast as what we got on the office PC for the previous version.

On the office PC, we only saved about 2500 K-Cycles, which means the aligned loads are indeed faster but not that much faster.

In any case, the important point is that our SIMD version is finally always faster than our best scalar version so far (version 8). Pheeeew!

Mystery solved?

Maybe, maybe not.

One question that pops up now is: why is the “safe” version measurably faster than the “unsafe” version on the home PC (14802 vs 16223), even though the safe version does a little bit more work? What’s going on here?

I was ready to move on but that stuff caught my eyes. I couldn’t explain it so I stayed a bit longer. Observed. And designed a new test to confirm the theory that the mystery had indeed been solved.

If the unaligned loads were indeed responsible for making the code 2X faster, then switching back to movups in version 9c (while keeping the separate classes) should also bring back the performance of version 9b – on the home PC in particular.

But. It. Did. Not.

On the home PC, changing the _mm_load_ps to _mm_loadu_ps (while keeping the separate classes, i.e. while keeping the data buffers 16-byte aligned) gave these results:

Safe version:

Complete test (box pruning): found 11811 intersections in 16513 K-cycles.

Unsafe version:

Complete test (box pruning): found 11715 intersections in 19180 K-cycles.

i.e. the cost of unaligned loads for the safe version was 16513 - 14802 = 1711 K-Cycles.

And the cost of unaligned loads for the unsafe version was 19180 - 16223 = 2957 K-Cycles.

That last number is pretty similar to the cost of unaligned loads on the office PC (as we mentioned, ~2500 K-Cycles). So that number is pretty consistent, and seems to indicate the real cost of movups vs movaps: roughly 2000 to 3000 K-Cycles, on both machines.

But then…

why did we go from ~32000 to ~16000 K-Cycles between versions 9b and 9c on the home PC? Where is that coming from, if not from movups?

The mystery was not solved, after all.

New questions, new theories, new tests. The only other change we did was the switch from “new” to “_aligned_malloc“. The SIMD_AABB_YZ has a force-inlined empty ctor, so this couldn’t be a case of calling a slow compiler-generated ctor a large number of times (this things happen). Just to test the theory that it had something to do with this change though, I switched from “new” to “_aligned_malloc” in version 9b (without splitting the classes).

And lo and behold, this change alone made version 9b go from ~32000 K-Cycles to ~18000 K-Cycles at home (for both safe & unsafe versions).

So that’s where the big performance drop was coming from!

After this line was identified, figuring out the “why” was a walk in the park. In short: the performance of movups depends on the alignment of the returned address.

In version 9b, on the home PC:

  • If the bounds array is aligned on a 4-bytes boundary, the function takes ~33000 K-Cycles.
  • If the bounds array is aligned on an 8-bytes boundary, the function takes ~19000 K-Cycles.
  • If the bounds array is aligned on a 16-bytes boundary, the function takes ~18000 K-Cycles.

So there is a huge hit for the 4-bytes boundary case, and we were previously unknowingly hitting it. Switching to _aligned_malloc, or using __declspec(align(N)) on the SIMD_AABB class, fixes the issue in version 9b.

Let’s recap for the home PC:

Instruction

Alignment

Version

Timings

movups

data 4-bytes aligned

single class (version9b)

33000

movups

data 8-bytes aligned

single class (version9b)

19000

movups

data 16-bytes aligned

single class (version9b)

18000

movups

data 16-bytes aligned

separate classes (version9c safe)

16000

movups

data 16-bytes aligned

separate classes (version9c unsafe)

19000

movaps

data 16-bytes aligned

separate classes (version9c safe)

14000

movaps

data 16-bytes aligned

separate classes (version9c unsafe)

16000

What this all means is that the initial theory was actually correct (the performance cost was indeed because of unaligned loads), but there was a subtlety in there: there’s the cost of the movups instruction itself compared to movaps (which is constant at ~2000 / 3000 K-Cycles) and the cost of the data-loading with movups (which is variable and sometimes considerably more expensive).

We also see that version 9b 16-bytes aligned can be faster than version 9c using movups (18000 vs 19000). That is probably because as we mentioned, version 9c needs to compute two addresses now.

And then finally, with movaps, the “safe” version 9c is still faster than the “unsafe” version 9c.

Why?

I still don’t know.

I would have to try V-Tune at that point. For now, I will be happy to admit that I don’t have an answer, and I don’t see what test I could add to find one.

In any case the performance difference is not large, and since the fastest version is also the safest, there is little motivation to go to the bottom of it.

But the main reason for not bothering is that, spoiler: that difference will vanish in the next post anyway.

What we learnt:

SIMD is hard. It took 3 versions for it to become faster than our scalar code.

The cost of unaligned loads can vary greatly depending on data alignment.

Data alignment has an impact on performance even with unaligned loads.

GitHub code for part 9c

Box pruning revisited - part 9b - better SIMD

Tuesday, February 14th, 2017

Part 9b – better SIMD

Last time we wrote an initial SIMD version that ended up slower than the scalar code. We then identified the _mm_set_ps intrinsics as a potential source of issues.

So today we want to replace _mm_set_ps with “real” SIMD loads: _mm_load_ps (for aligned loads) or _mm_loadu_ps (for unaligned loads). These are the intrinsics that each map to a single assembly instruction (respectively movaps and movups). They load 4 contiguous floats into a SIMD register, so we will need to reorder the data. If we go back to the initial C++ code and rearrange it so that all elements of the same box are on the same side, we get:

i.e. the a’s are all on the left side, and the b’s are all on the right side.

This is going to be tricky since we have a mix of “greater than” and “lesser than” comparisons in the middle. Let’s ignore it for now and focus on the loads.

For a given box, we want to load the y’s and z’s (both min and max) with a single load. Unfortunately the AABB class so far looks like this:

That is, switching back to plain floats:

With this format we cannot load the y’s and z’s at the same time, since they’re interleaved with the x’s. So, we change the format, and introduce this new AABB class:

It’s the same data as before, we just put the x’s first. That way the 4 floats we are interested in are stored contiguously in memory, and we can load them to an SSE register with a single assembly instruction. Which gives us this in-register layout, for two boxes:

__m128 box0 (a)

mMinY

mMinZ

mMaxY

mMaxZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Go back to the C++ code and you’ll see that we need to compare minimums of (a) to maximums of (b). So we need to reorder the elements for one of the two boxes.

Since one of the box (say box0) is constant for the duration of the loop, it makes more sense to reorder the constant box (only once), instead of doing the reordering inside the loop. This is done with a single shuffle instruction (_mm_shuffle_ps), and we end up with:

__m128 box0 (a)

mMaxY

mMaxZ

mMinY

mMinZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Just to see where we stand, let’s do a SIMD comparison (_mm_cmpgt_ps) between (a) and (b). This computes:

a.mMaxY > b.mMinY

a.mMaxZ > b.mMinZ

a.mMinY > b.mMaxY

a.mMinZ > b.mMaxZ

The two last lines are actually directly what we need:

a.mMinY > b.mMaxY => that’s (1)

a.mMinZ > b.mMaxZ => that’s (3)

The two other lines compute almost what we want, but not quite.

a.mMaxY > b.mMinY => that’s the opposite of (2)

a.mMaxZ > b.mMinZ => that’s the opposite of (4)

Well the opposite is not bad, it also gives us the information we need, just by flipping the result. The flipping is virtual and free: we are not going to actually do it. It just means that we test the movemask result against a different expected value. That is, if the previous expected value was, say, 1111b in binary, we can just flip two of the bits for which we get the opposite results, and test against, say 1100b instead. It gives us the same information, and the proper overlap test results…

…except in the equality case.

If a.mMaxY==b.mMinY for example:

  • the C++ code tests “a.mMaxY < b.mMinY” and returns 0, which is the desired reference result.
  • the SIMD code tests “a.mMaxY > b.mMinY” and also returns 0, which gives us a 1 after flipping.

Replacing the “greater than” comparison with “greater or equal” wouldn’t work, because what we would really need is a SIMD comparison that performs the former on two of the components, and the later on the two others. So, the way it stands, the SIMD code is equivalent to this:

Almost what we wanted!

And in fact, we could very well accept the differences and stop here.

After all, the equality case is often a bit fuzzy. Don’t forget that we are dealing with floats here. It is bad form to use exact equality when comparing floats. When using the box pruning function for the broadphase in a physics engine, the equality case is meaningless since the bounding boxes are constantly evolving, recomputed each frame, and there is always a bit of noise in the results due to the nature of floating point arithmetic. The results will depend on how the compiler reorganized the instructions, whether x87 or SIMD is used, it will depend on the internal FPU accuracy of x87 registers, i.e. the state of the FPU control word, and so on. Basically the case where the resulting bounding boxes are exactly touching is so ill-defined that it really doesn’t make any difference if the SIMD code uses “>” or “>=” on some of the components.

If we accept this limitation, we can run the test again and see the results:

Home PC:

Complete test (brute force): found 11811 intersections in 781912 K-cycles.
33554 K-cycles.
32514 K-cycles.
32803 K-cycles.
32498 K-cycles.
32486 K-cycles.
32487 K-cycles.
32499 K-cycles.
32488 K-cycles.
32669 K-cycles.
32672 K-cycles.
32855 K-cycles.
32537 K-cycles.
32504 K-cycles.
32477 K-cycles.
32674 K-cycles.
32505 K-cycles.
Complete test (box pruning): found 11725 intersections in 32477 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 808659 K-cycles.
15775 K-cycles.
15266 K-cycles.
15097 K-cycles.
15302 K-cycles.
15259 K-cycles.
15554 K-cycles.
15560 K-cycles.
16137 K-cycles.
15709 K-cycles.
15109 K-cycles.
15317 K-cycles.
15149 K-cycles.
15238 K-cycles.
15239 K-cycles.
15525 K-cycles.
15097 K-cycles.
Complete test (box pruning): found 11725 intersections in 15097 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b

32477

~2000

~5%

~3.04

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b

15097

~16000

~52%

~6.15

WO-AH.

Ok, pause, that’s too much weirdness at the same time.

First, the reported number of intersections is wrong now (11725 vs 11811). That code is broken!

Second, the code is not much faster than the naive version 9a at home. Why?!

Third, the code is now 2X faster than the naive version on the office PC. What’s going on?!

Well, that’s SIMD for you. Things can become a little bonkers.

Let’s talk about the number of overlapping pairs first. It should be expected that the numbers are going to be different between the brute-force and optimized versions now. The brute-force version still uses this:

While the optimized version effectively uses this:

It doesn’t compute the same thing, as we previously discussed, so, obviously, the reported number of overlapping pairs can be different. It simply means that some of the boxes in the set are exactly touching.

Beyond this, there is however a much more subtle reason at play here. If you change the AABB::Intersect() function so that it handles the equality case like the SIMD code, you still don’t get the same number of pairs. That’s because the SIMD code has another issue we didn’t mention so far: it is not symmetric anymore. That is, Intersect(A, B) != Intersect(B, A) in the equality case. Imagine two boxes for which a.mMaxY == b.mMinY. Since the code tests “a.mMaxY <= b.mMinY”, we get a positive result here. Now swap a and b. We now have b.mMaxY == a.mMinY, but the code tests “a.mMinY > b.mMaxY”, which returns a negative result. What this means is that the results will depend on the order in which the boxes are tested by the functions.

For the brute-force function we always use the same order, and who is “a” or “b” never changes. But the optimized code sorts the boxes according to X, so who is “a” or “b” changes depending on the box positions. So we cannot easily compare the two versions anymore, which makes testing and validating the code a bit problematic. One way to do it is to explicitly avoid the equality case in the input set, in which case the returned number of intersections becomes the same again for the two functions. Another way to do it is to pre-sort the boxes along X, to ensure that the brute-force code will process the boxes in the same order as the SIMD code. If you do all that (change the AABB::Intersect() function, pre-sort the boxes), the validity tests always pass and the number of pairs is always the same for the two functions. The code is not broken!

However, you may find these difficulties a bit too much to swallow. Fair enough. Since the SIMD function is harder to test and validate that before, we labeled it “unsafe”. And before discussing the performance results, we’re going to see how to make it “safe”, i.e. how to make it compute exactly the same overlaps as the scalar code, in all cases.

Let’s go back to the initial scalar code:

Right, so the problem was that with a’s and b’s on the same side, we get a mix of “<” and “>” operators. To fix this, we could just multiply some of the components by -1. Which would give:

And… that just works. It adds some overhead though, in particular one extra load and one extra multiply within the inner loop (since we need to fix both boxes).

But we can improve this. Since version 6b we are parsing internal bounds arrays in the main loop, rather than the user’s input array. So we are free to store the bounds in any form we want, and in particular we can pre-multiply mMin.y and mMin.z by -1.0. That way we don’t need to do the multiply on box (b) within the inner loop.

On the other hand we still need to flip the signs of mMax.y and mMax.z for box (a), and since we pre-multiplied the mins (which has an effect on box (a) as well, it comes from the same buffer), the net result is that we now need to multiply the whole box (a) by -1.0 at runtime. This makes the code simpler, and since box (a) is constant for the duration of the inner loop, the impact on performance is minimal. In the end, the fix to go from the unsafe version to the safe version is just two additional instructions (_mm_load1_ps + _mm_mul_ps) on the constant box. That’s all!

And the impact on performance is invisible:

Home PC:

Complete test (brute force): found 11811 intersections in 781499 K-cycles.
34077 K-cycles.
32779 K-cycles.
32584 K-cycles.
33237 K-cycles.
32656 K-cycles.
32618 K-cycles.
33197 K-cycles.
32662 K-cycles.
32757 K-cycles.
32582 K-cycles.
32574 K-cycles.
32579 K-cycles.
32592 K-cycles.
32565 K-cycles.
32867 K-cycles.
32591 K-cycles.
Complete test (box pruning): found 11811 intersections in 32565 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 822472 K-cycles.
16558 K-cycles.
16051 K-cycles.
15127 K-cycles.
15324 K-cycles.
15667 K-cycles.
15169 K-cycles.
15117 K-cycles.
15324 K-cycles.
15117 K-cycles.
15902 K-cycles.
16729 K-cycles.
15860 K-cycles.
15937 K-cycles.
15218 K-cycles.
15356 K-cycles.
15116 K-cycles.
Complete test (box pruning): found 11811 intersections in 15116 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

This is basically the same speed as the unsafe version. Yay!

We are still going to keep the two unsafe and safe versions around for a while, in case further optimizations work better with one of them.

Note how all of this came together thanks to previous optimizations:

  • We were able to use a SIMD comparison because we dropped the user-defined axes in version 5.
  • We were able to create a SIMD-friendly AABB class because we now parse internal buffers instead of the user’s buffers, after version 6b.
  • For the same reason we were able to go from the unsafe version to the safe version “for free” because we could store our boxes pre-flipped. This was made possible by version 6b as well.

That’s quite enough for one post so we will investigate the remaining mystery next time: why are the timings so different between the home PC and the office PC?

What we learnt:

The same SIMD code can have very different performance on different machines.

Therefore it is important to test the results on different machines. If you would only test this on the home PC, after writing two SIMD versions (9a / 9b) that end up slower than a scalar version (8), you might be tempted to give up. You need to see the results on the office PC to remain motivated.

There’s a ripple effect with optimizations: some of them may have value not because of their limited performance gains, but simply because they open the door for more optimizations further down the road.

GitHub code for part 9b

Box pruning revisited - part 9a - SIMD

Tuesday, February 14th, 2017

Part 9a - SIMD

For this post, we start again from version 6b, ignoring our experiments with integer comparisons and branches in versions 7 and 8. That is, we go back to this version:

The reason for doing so is simply that the above function is easier to work with: these 4 comparisons are a perfect fit for SIMD. So we will try to replace them with a single SIMD comparison, and thus, a single branch. We can skip the work we did in version 8 since SIMD will replicate this naturally.

After the modification we did in part 5, we now read y’s and z’s directly there, and the code is ripe for SIMDifying. It would have been much more difficult to see or consider doing with the initial implementation, where the axes came from a user-defined parameter. So this is the moment where we harvest the “great things coming from small beginnings”.

Doing so is not entirely trivial though. There is SIMD and SIMD. Converting something to SIMD does not guarantee performance gains. A naive conversion can in fact easily be slower than the scalar code you started from.

But let’s give it a try. We want to compute:

So we could just load these elements in two SSE registers, do a single SSE comparison, use movemask to get the results, and… done? The modifications are very simple, just replace this:

With this:

It is straightforward and it works:

Home PC:

Complete test (brute force): found 11811 intersections in 781594 K-cycles.
38231 K-cycles.
34532 K-cycles.
34598 K-cycles.
34996 K-cycles.
34546 K-cycles.
34486 K-cycles.
34649 K-cycles.
34510 K-cycles.
34709 K-cycles.
34533 K-cycles.
34496 K-cycles.
34621 K-cycles.
34556 K-cycles.
34867 K-cycles.
35392 K-cycles.
34666 K-cycles.
Complete test (box pruning): found 11811 intersections in 34486 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810655 K-cycles.
32847 K-cycles.
32021 K-cycles.
32046 K-cycles.
31932 K-cycles.
32283 K-cycles.
32139 K-cycles.
31864 K-cycles.
32325 K-cycles.
33420 K-cycles.
32087 K-cycles.
31969 K-cycles.
32018 K-cycles.
32879 K-cycles.
32588 K-cycles.
32214 K-cycles.
31875 K-cycles.
Complete test (box pruning): found 11811 intersections in 31864 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

The gains for version 9a are relative to version 6b (we ignored versions 7 and 8). And we see something interesting here: our SIMD version “worked”, it did provide clear gains compared to version 6b, but… it remains slower than our branchless scalar version, version8.

Oops.

This is the source of never-ending myths and confusion about SIMD on PC: if you started from version 6b, you are likely to conclude that adding SIMD is easy and provides easy gains. If you started from version 8, you are likely to conclude that SIMD is slower than “optimized” scalar code, and not worth the trouble. And if you started from version 8, “optimized” the code with SIMD, and concluded without benchmarking that Version9a was faster, then you are a bloody fool.

In any case, you’d all be wrong. As previously mentioned, there is SIMD and SIMD. Our initial attempt here is just a little bit rough and, indeed, suboptimal.

How do we see it?

As usual: always check the disassembly!

If you do, you will discover that the above SIMD code, which is what I would call the naive version, gave birth to this monstrosity:

Soooo… errrr… fair enough: that’s only one comparison (cmpltps) and one branch (test/jne) indeed. But that’s still quite a bit more than what we asked for.

In particular, one may wonder where the unpack instructions (unpcklps) are coming from - we certainly didn’t use _mm_unpacklo_ps in our code!

So what’s going on?

Well, what’s going on is that we fell into the intrinsics trap. There is usually a one-to-one mapping between intrinsics and assembly instructions, but not always. Some of them are “composite” intrinsics that can generate an arbitrary number of assembly instructions - whatever it takes to implement the desired action. This is the case for _mm_set_ps, and this is where all the movss and unpcklps are coming from. The two _mm_set_ps intrinsics generated 14 assembly instructions here. A lot of people assume that using intrinsics is “the same” as writing in assembly directly: it is, until it isn’t. Sometimes it hides a lot of important details behind syntactic sugar, and that can bite you hard if you are not aware of it.

So here’s our lesson: if your SIMD code uses _mm_set_ps, it is probably not optimal.

While we’re at it, there is another lesson to learn here: the compiler won’t do the SIMD job for you. We did not have to reorder the data here, we did not have to align structures on 16-byte boundaries, we did not have to do complicated things, the change was as straightforward as it gets.

And yet, even in this simple case, with the /SSE2 flag enabled, the compiler still didn’t generate the naive SIMD version on its own. So once again: people claiming that using the /SSE2 flag will automatically make the code “4x faster” (as I read online a few times) are tragically, comically wrong. For that kind of speedup, you need to work hard.

And work hard, we will.

Back to the drawing board.

What we learnt:

Writing something in SIMD does not guarantee better performance than a scalar version.

There’s no 1:1 mapping between intrinsics and assembly instructions when you use “composite” intrinsics.

Don’t use _mm_set_ps.

The compiler will not even write naive SIMD for you.

GitHub code for part 9a

Box pruning revisited - part 8 - branchless overlap test

Tuesday, February 14th, 2017

Part 8 – branchless overlap test

For this post, we start again from version 6b, ignoring our experiments with integer comparisons in version 7. We are going to stay focused on comparisons though, and looking at this function:

It contains four comparisons, and four branches. The box values are going to be arbitrary, so these branches are the worst you can get: branch prediction will essentially never work for them. If we check the disassembly, it looks like this:

Pretty much what you’d expect from the C++ code, and all the comparisons and branches are there indeed.

Comparisons and branches. We often talk about these two as if they would always go hand in hand and come together, but they are very different things and it’s important to distinguish between the two. In terms of performance, one of them is more expensive than the other.

Let me demonstrate. We can rewrite the overlap function this way:

Compare to the previous function, and you easily see that the comparisons are still there in the C++ code. We still compare max floats to min floats, four times, none of that has changed. We just wrote the code slightly differently.

Compile it, run it, and… You get exactly the same timings as before. In fact, you get exactly the same disassembly. Thus, the branches are also all there. Well really, that makes sense.

Now, however, for something that perhaps does not make as much sense, remove three characters from this function and write it this way instead:

Pretty much the same, right?

Well, not quite. The compiler begs to differ, and produces the following disassembly:

The comparisons are still here (the “comiss”) but the branches are mostly gone: there’s only one branch instead of four. That’s all thanks to the “seta” instructions, which can store the results of our comparisons directly in a register, in a branchless way. Technically the final branch is not even in the overlap function itself, it’s just part of the calling code that uses the returned value. The overlap function itself became branchless.

Using | instead of || told the compiler it was necessary to perform all comparisons to create the final bool, while the previous version had early-exits (and sometimes only performed one comparison instead of four). The new code is also larger, since it needs to concatenate the bools to create the final return value. But using more instructions does not always mean slower results:

Home PC

Complete test (brute force): found 11811 intersections in 778666 K-cycles.
32497 K-cycles.
31411 K-cycles.
31402 K-cycles.
31637 K-cycles.
31415 K-cycles.
31440 K-cycles.
31404 K-cycles.
31412 K-cycles.
31388 K-cycles.
31596 K-cycles.
31411 K-cycles.
31383 K-cycles.
31595 K-cycles.
31405 K-cycles.
31383 K-cycles.
31392 K-cycles.
Complete test (box pruning): found 11811 intersections in 31383 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 819311 K-cycles.
29897 K-cycles.
29563 K-cycles.
29620 K-cycles.
29430 K-cycles.
29641 K-cycles.
29352 K-cycles.
29363 K-cycles.
29305 K-cycles.
30214 K-cycles.
29538 K-cycles.
31417 K-cycles.
30416 K-cycles.
30112 K-cycles.
30443 K-cycles.
30105 K-cycles.
29083 K-cycles.
Complete test (box pruning): found 11811 intersections in 29083 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

Version8

31383

~10000

~24%

~3.14

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

Version8

29083

~16000

~36%

~3.19

Well, we are now 3 times faster than when we started. That’s another milestone I suppose.

It is important to note that going branchless does not always pay off. Branches are expensive if mispredicted, but cheap if correctly predicted. Also, a branch that early-exits and skips an expensive part of the code can still be cheaper overall than running the expensive part of the code. It’s obvious, but multiple times I’ve seen people rewrite functions in a branchless way, and just assume that the result was better - without benchmarking. Unfortunately the results were often slower. There is absolutely no guarantee here. If you blindly follow a rule or a recipe, removing branches just for the sake of removing branches, you won’t go very far. This is a case where you need to test, profile, use your brain.

What we learnt:

Comparisons and branches are two different things.

Mispredicted branches are costly, rewriting in a branchless way can help here.

Don’t assume the results will be faster. Profile the difference.

The C++ versions can look very similar (only 3 extra characters) but have a very different disassembly, and very different performance. You need to look through the C++, what matters is the generated code.

Are we done?

No!

Next time we will finally attack the SIMD version.

GitHub code for part 8

Box pruning revisited - part 7 - integer comparisons

Tuesday, February 14th, 2017

Part 7 – integer comparisons

I spent a lot of time using PIX on the Xbox360. This is a profiler tool, which incidentally was just released on PC (I did not try it yet, I don’t know how it compares to the Xbox version).

On the Xbox it typically identified and reported 3 main problems: L2 cache misses, FCMPs, and LHS.

Well, granted: it reported a lot more than this, but these guys were the typical issues one had to pay attention to.

We just took care of L2 cache misses, or data cache misses, in the previous posts.

LHS (for Load-Hit-Store) were a nuisance specific to the Xbox360 that we will ignore for now. They do exist on PC, but their performance penalty is not as severe as on the previous-gen Xbox.

So that leaves the FCMPs, i.e. the float comparisons. Weird thing, the FCMPs. They used to be very costly on old Pentiums IIRC. And then I think for a while they were ok. And then with the Xbox360 suddenly they were bad again. These things come and go.

This gave birth to a whole bunch of optimizations where FCMPs got replaced with regular integer comparisons, which were faster. It made a huge difference on the 360, and I kept doing this kind of stuff as a habit. But to be fair, this may not be needed anymore on modern PCs. And that’s what we are going to investigate today.

We have to do it now, before starting the SIMD work, because it might also affect the SIMD code. There is floating point SIMD and integer SIMD, and with SSE2 the instruction sets are quite different. It is unclear if there would be a benefit using one or the other at this point, so we will investigate, and maybe try both.

But SIMD will bring some complications to the table, so it is easier to start slowly and first play with integer comparisons while the code is still scalar.

I already wrote about this issue in the already mentioned SAP document. Go read Appendix A this time, page 22. I’ll wait.

Back? Good.

As mentioned in the document, the changes are quite straightforward. We replace our AABB with its integer counterpart:

Then we encode its min & max values using the function from the SAP document:

And then we just replace some “float” with “udword” in the main loop, float comparisons silently become integer comparisons, and that’s pretty much it really. Once you understand the encoding function, the modifications are trivial. Well, as long as we’re not using SIMD at least.

The results are as follows:

Home PC:

Complete test (brute force): found 11811 intersections in 782125 K-cycles.
41727 K-cycles.
41126 K-cycles.
40956 K-cycles.
41431 K-cycles.
41042 K-cycles.
41169 K-cycles.
40928 K-cycles.
40916 K-cycles.
40912 K-cycles.
40906 K-cycles.
41116 K-cycles.
40927 K-cycles.
40911 K-cycles.
40969 K-cycles.
41143 K-cycles.
41251 K-cycles.
Complete test (box pruning): found 11811 intersections in 40906 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 823453 K-cycles.
44674 K-cycles.
45127 K-cycles.
45929 K-cycles.
44652 K-cycles.
44192 K-cycles.
44502 K-cycles.
45403 K-cycles.
44062 K-cycles.
44859 K-cycles.
44060 K-cycles.
44679 K-cycles.
43987 K-cycles.
44537 K-cycles.
44417 K-cycles.
44842 K-cycles.
44395 K-cycles.
Complete test (box pruning): found 11811 intersections in 43987 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

Version7

40906

-

-

-

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

Version7

43987

-

-

-

Sooooo….. That one is not very conclusive. It appears to be a tiny bit faster, but this is far from the speedups we used to get on the Xbox.

I would say that the jury is still out on this one, but either way, the speedups are too small to justify the trouble at this point. So we will ignore integer comparisons for now and perhaps revisit this later with SIMD.

The results are curious though, because this stuff used to matter a lot. What happened?

Again, you can design a small experiment to reveal the truth. The only difference between version 6b and version 7 is the switch from floating-point to integer-comparisons.

So what we can do is compile both without the SSE2 flag, i.e. with the explicit /arch:IA32 flag to ensure that x87 code is used.

Here are the results on the home PC:

Version 6b with /SSE2 flag

41605 K-cycles

Version 6b with x87 code

52163 K-cycles

Version 7 with /SSE2 flag

40906 K-cycles

Version 7 with x87 code

40897 K-cycles

Ah-ah! So that was it, look at this!

With the x87 code, switching from float to integer comparisons has a huge impact.

With the SSE2 code, switching from float to integer comparisons has almost no impact.

Using integer comparisons with the x87 code gave the same speedup as switching to SSE2 code with float comparisons.

So that’s what happened: using integer comparisons gave big speedups in the past with x87 code (and even more so on the Xbox 360), but compiling with the /SSE2 flag gives the same speedups out-of-the-box, making this optimization obsolete.

And that’s our lesson of the day: don’t assume anything. Question everything. Constantly revisit your previous conclusions.

We will now ignore version 7 entirely: the minor gains are not worth the trouble.

What we learnt:

Replacing float comparisons with integer comparisons in scalar code might still provide some minor gains, but not as much as in the past. Using the SSE2 flag now gives essentially the same speedups out-of-the-box.

One great optimization one day on one platform might become worthless (or even detrimental) another day, on another platform.

You can learn something even from a “failed” optimization. Things that did not work are also worth reporting and talking about.

Ok! With this question out of the way, we’re ready to tackle the SIMD version.

See you next time.

GitHub code for part 7

Box pruning revisited - part 6b - less cache misses

Tuesday, February 14th, 2017

Part 6b – less cache misses

Last time we took care of our input data, and packed it in a nice flat linear array. This is necessary but not sufficient to guarantee high performance: it only helps if you actually read that input array sequentially. But if you read random boxes within the input array, you still get cache misses.

But do we still get cache misses? We got great speedups with our previous change so aren’t we done? How do we know if there is more to gain from “the same” optimization?

One way is simply to look at the code. With some experience you can relatively easily spot the potential problems and “see” if the code is sound or not.

Another way would be to use a profiler like V-Tune, and check how many cache misses you get in that function.

And sometimes you can simply design small experiments, small tests that artificially remove the potential problems. If the performance doesn’t change, then you know you’re good, nothing left to do. If the performance does change, then you know you have some more work ahead of you.

This probably sounds a bit abstract so let me show you what I mean. The code is like this now:

So we’re going to read from this badly-named “list” array. We look for “list” in the code. We first find this:

This reads the array sequentially, this is theoretically optimal, so no problem here. Granted: we don’t use all the data in the cache line we just fetched, but let’s ignore this. For now we are just checking whether we read the array sequentially or in random order.

Next we find this:

With:

And “Sorted” is the output of the radix sort, which sorted boxes along the X axis.

Well this one is not good. The radix implementation is for a “rank sorter”, i.e. it returns sorted indices of input bounds. But it does not sort the input array itself (contrary to what std::sort would do, for example). This means that we keep reading random boxes in the middle of the box-pruning loop. Oops.

And the same problem occurs for Index1 later on:

So this is what I mentioned above: you read the code, and you can easily spot theoretical problems.

And then you can sometimes design tests to artificially remove the problem and see if performance changes. In this case we’re lucky because it’s easy to do: we can just pre-sort the input array (the bounds) before calling the box-pruning function. That way the sorting inside the function will have no effect, the radix sort will return an identity permutation, the code will read the input array sequentially, and we will see how much time we waste in cache misses.

Quickly said, quickly done, the change is just like this in Main.cpp:

Change gPresortBounds from false to true, run the test again… and….

Home PC:

Complete test (brute force): found 11811 intersections in 355897 K-cycles.

Complete test (box pruning): found 11811 intersections in 46586 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 312868 K-cycles.

Complete test (box pruning): found 11811 intersections in 49315 K-cycles.

Wo-ah.

Now that’s interesting. The function became faster again. This means that we are still suffering from cache misses out there, and that there is more to do to address this issue.

It is also a great example of why data-oriented design matters: the code did not change at all. It is the exact same function as before, same number of instructions, same position in the executable. It computes the same thing, returns the same number of overlapping pairs.

But somehow it’s 15% / 20% faster, just because we reorganized the input data. Ignore this, focus only on the code, and that’s the amount of performance you leave on the table in this case.

On a side note, we saw this kind of things in the past, in different contexts. For example triangle lists reorganized to take advantage of the vertex cache were faster to render - sometimes faster than strips. Even on older machines: making the code run faster by just changing the data is very reminiscent of what Leonard / Oxygene reported when working on his 16*16 “sprite record” demos on Atari ST. Several times in these experiments the Atari ST code didn’t change at all, but he was able to display more and more sprites by improving his “data builder” (a tool running on a PC). This had nothing to do with cache misses, but it was the same observation: you get the best results by optimizing both the code and the data. In his case, the data was not even optimized on the same machine/architecture.

So, anyway: we identified the problem. Now how do we solve it?

Well, the obvious way: we just create a secondary bounds array, sorted in the desired order, and we parse that sorted array instead of the original one. Basically we simply replicate what we did above to presort bounds, inside the box pruning function.

So, yes, we need more memory. And yes, the setup code becomes more complex (but it was not the bottleneck at all, remember? So it doesn’t matter).

On the other hand, now that the array itself is sorted, there is no need to work with the sorted indices anymore: they will always be the identity permutation. So the setup code becomes more complex but the inner loop’s code can actually be further simplified.

One complication is that we need to introduce a remapping between the user’s box indices and our now entirely sequential internal box indices. Fortunately we already have the remap table: it is the same as the sorted indices returned by the radix sort! So it all fits perfectly together and the only thing we need to do is remap the box indices after an overlap has been found, before writing them to the output buffer. In other words: we removed indirections in the frequent case (for each parsed box), and added indirections in the infrequent case (when an overlap is actually found).

The results speak for themselves.

Home PC:

Complete test (brute force): found 11811 intersections in 781350 K-cycles.
45584 K-cycles.
44734 K-cycles.
41679 K-cycles.
41810 K-cycles.
41654 K-cycles.
41623 K-cycles.
41634 K-cycles.
41605 K-cycles.
41779 K-cycles.
41633 K-cycles.
42037 K-cycles.
41752 K-cycles.
41836 K-cycles.
41639 K-cycles.
41618 K-cycles.
41636 K-cycles.
Complete test (box pruning): found 11811 intersections in 41605 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 813615 K-cycles.
46985 K-cycles.
46282 K-cycles.
47889 K-cycles.
47757 K-cycles.
55044 K-cycles.
52660 K-cycles.
47923 K-cycles.
53199 K-cycles.
47410 K-cycles.
46192 K-cycles.
45860 K-cycles.
46140 K-cycles.
45634 K-cycles.
46274 K-cycles.
47077 K-cycles.
46284 K-cycles.
Complete test (box pruning): found 11811 intersections in 45634 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

This is even faster than what we got during our previous experiment with pre-sorted bounds. Which means that not only we took care of the cache misses, but also the extra simplifications of the main loop provided additional gains. Bonus!

And with that, we became 2X faster than the version we started from. That’s a milestone.

What we learnt:

The same as in the previous post, but it’s so important it’s worth repeating: cache misses are more costly than anything else. If you can only optimize one thing, that’s what you should focus on.

Before we wrap this one up, let’s come back to the loss of genericity we mentioned before. At the end of part 5 we said that we could get back our generic version for no additional cost. That is, we could put back the user-defined axes and don’t pay a performance price for it.

It should be clear why this is the case. We are now parsing internal buffers in the main loop, instead of the user-provided source array. So we could very easily flip the box coordinates while filling up these internal buffers, in the setup code. The main inner loop would remain untouched, so it would give back the user-defined axes virtually for free.

So that’s one loose end taken care of. But there is another big one left, also mentioned in part 5: SIMD.

We will slowly move towards it in next post.

GitHub code for part 6b

Box pruning revisited - part 6a - data-oriented design

Tuesday, February 14th, 2017

Part 6a – data-oriented design

There has been a lot of talks and discussions in the past about “data-oriented programming” and how it leads to faster code. Just Google the term along with “Mike Acton” and you will find plenty of it. I don’t need to repeat everything here: Mike is right. Listen to Mike.

In practical terms: we need to think about cache misses now. What we do with the data is one thing, but how we access the data is another equally important thing.

As far as our box pruning function is concerned, the data is pretty simple: we read an array of AABB pointers in input, and we write an array of pairs in output.

Reading from and writing to linear arrays is the most cache-friendly thing one can do, so we’re not starting from a bad place (there is no linked lists or complicated structures here). However we are reading an array of AABB pointers, not an array of AABBs. And that extra indirection is probably costly.

I remember using this design for practical reasons: I wanted the function to be user-friendly. At the time I was using a classical “object-oriented” design for my game objects, and I had something like this:

The AABB of each object was embedded within the game object class, vanilla C++ design. And I didn’t want the library to force users to first gather and copy the bounds in a separate bounds array just to be able to call the function. So instead, the function accepts an array of AABB pointers, so that it can read the AABBs directly from the game objects. The AABBs can neatly stay as a member of the objects, there is no need to duplicate them in a separate array and use more memory, no need to break up the neat objects into disjoint classes, it all looked quite nice.

And I suppose it was. Nice.

But fast, no, that, it was not.

Reading the bounds directly from the game objects is an extraordinarily bad idea: it makes you read some scattered data from some random location for each box, it’s the perfect recipe for cache misses and basically the worst you can do.

You can solve it with prefetch calls, to some extents. But prefetching is a fragile solution that tends to break when code gets refactored, and making it work in a cross-platform way is tedious (the different platforms may not even have the same cache line size, so you can easily imagine the complications).

A better idea is to switch to “data-oriented design” instead of “object-oriented design”. In our case it simply means that the bounds are extracted from the game objects and gathered in an array of bounds, typically called “bounds manager”. Something along these lines:

Vanilla data-oriented design, that one.

There are pros & cons to both approaches, which are beyond the scope of this blog post. All I am concerned about here is the performance of the box pruning function, and in this respect the data-oriented design wins big time.

The modifications to the code are minimal. The function signature simply becomes:

And the corresponding indirections are removed from the code. That’s it. That’s the only difference. We don’t “optimize” the code otherwise. But the effect on performance is still dramatic:

Home PC:

Complete test (brute force): found 11811 intersections in 782034 K-cycles.
64260 K-cycles.
60884 K-cycles.
60645 K-cycles.
60586 K-cycles.
60814 K-cycles.
62650 K-cycles.
60591 K-cycles.
60697 K-cycles.
60833 K-cycles.
60579 K-cycles.
60601 K-cycles.
60905 K-cycles.
60596 K-cycles.
60836 K-cycles.
60673 K-cycles.
60585 K-cycles.
Complete test (box pruning): found 11811 intersections in 60579 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810333 K-cycles.
59420 K-cycles.
58451 K-cycles.
59005 K-cycles.
65448 K-cycles.
64930 K-cycles.
62007 K-cycles.
59005 K-cycles.
58847 K-cycles.
58904 K-cycles.
58462 K-cycles.
58843 K-cycles.
58655 K-cycles.
58583 K-cycles.
59056 K-cycles.
58776 K-cycles.
58556 K-cycles.
Complete test (box pruning): found 11811 intersections in 58451 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

The modifications to the code have been the smallest so far (really just replacing “->” with “.” here and there), and yet this gave the biggest speedup until now. In other words: our current best optimization did not come from modifying how we process the data (==roughly, “the code”), or from changing the data itself, but instead it came from changing how we access the data. That’s a pretty powerful result.

What we learnt:

Choose data-oriented design over object-oriented design for performance.

The data access pattern is often more important than anything else.

We are not done yet though. There is a lot more to do w.r.t. optimizing our data access, as we will see next time. But this change was so small and yet so effective that it was worth isolating in a separate blog post to drive the point home.

GitHub code for part 6a

Box pruning revisited - part 5 - hardcoding

Tuesday, February 14th, 2017

Part 5 – hardcoding

Last time we saw how making the code less generic could also make it faster. This trend continues in this post.

There is another part of the box-pruning function that looks suspiciously over-generic. In the original code the sorting axes are defined by the user and sent as a parameter to the function:

In the initial code from 2002, I passed the following axes to the function:

Inside the function they are copied to local variables:

After that, the boxes are sorted along the main axis (Axis0), which essentially replicates the X-related part of the AABB::Intersect(const AABB& a) function. In other words, the sorting and scanning procedure emulates this line from AABB::Intersect():

And the remaining AABB::Intersect() tests for Y and Z are implemented using the remaining Axis1 and Axis2 variables:

This is all good and fine. Unfortunately it is also sub-optimal.

It was invisible to me when I first wrote the code, but I know now that fetching the axes from user-defined variables has a measurable impact on performance. I saw it happen multiple times in multiple algorithms. The cost is not always large, but it is always real: fetching the axes is going to be either an extra read from the stack, or one less register available to the optimizer to do its job - and there aren’t a lot of registers in the first place on x86. Either way it makes the addressing mode more complex, and if all these things happen in the middle of an inner loop, it quickly accumulates and becomes measurable.

So a much better idea is simply to hardcode the axes.

Well, a more efficient idea at least. Whether it is “better” or not is debatable.

Let’s see the effect on performance first, and debate later.

So the function sorted the boxes along X, and then did extra overlap tests on Y and Z within the most inner loop. The rationale for this default choice was that the up axis is usually either Y or Z, “never” X. You typically don’t want to sort along your up axis, because most games have a flat, essentially 2D world, even if they’re 3D games. If you project the bounds of such a world along the up axis, the projections will pretty much all overlap each-other, and this will greatly reduce the “pruning power” of the main sort. Basically the algorithm would then degenerate to the O(n^2) brute-force version.

Thus, the safest main sorting axis is X, because it is usually not the vertical axis. After that, and for similar reasons, it is usually slightly better to use the remaining non-vertical axis for “Axis1″ and keep the vertical axis last in “Axis2″ (the 2D overlap test can then early-exit more quickly).

One can also analyze the distribution of objects at runtime and select a different main axis each frame, depending on how objects are spatially organized at any given time. That’s what I did in my old Z-Collide library, eons ago.

Similarly, one could try to make the code more generic by removing the requirements that the input axes are the regular X, Y, Z coordinate axes. After all it is perfectly possible to project the boxes along an arbitrary 3D axis, so the main sorting axis could be anything, and the remaining other two axes could be derived from it in the same way a frame can be derived from just a view vector. Unfortunately making the code more generic in such a way, while “better” in theory (because the main sorting axis could be “optimal”), is a terrible idea in practice. You need to introduce dot-products to project the boxes instead of directly reading the bounds’ coordinates, it adds a lot of overhead, and the resulting code is ultimately slower than what you started with - even with a “better” main sorting axis.

No, as we discussed in part 4, what you need to do for performance is the opposite: make the code less generic.

And hardcode the axes.

Do this, and then that part goes away:

It’s replaced with simply:

Similarly, and especially, this also goes away:

It’s replaced with:

Which is implemented this way (note the hardcoding of axes):

So we completely eliminate the need for passing axes to the function, or reading them from local variables. By reading “.x”, “.y” and “.z” directly, we give the compiler the freedom to simply hardcode the offsets in the address calculation. Let’s look at the resulting disassembly.

It looked like this before the change:

It looks like this after the change:

Did you spot the difference?

“0Ch” is the offset to go from min to max within the AABB class, so you can easily guess that eax in the first movss is the AABB address.

“ecx*4” disappeared entirely. That was the part dealing with the user-defined axes.

So we got the expected results:

  • the addressing mode became simpler (“eax+0Ch” instead of “eax+ecx*4+0Ch”)
  • we got one less read from the stack / one less instruction (“mov ecx, dword ptr [esp+14h]” vanished)
  • register pressure decreased (new code doesn’t use ecx, while initial code did)

The performance increase is minor in this case, but measurable:

Home PC:

Complete test (brute force): found 11811 intersections in 781460 K-cycles.
78971 K-cycles.
78157 K-cycles.
78216 K-cycles.
78319 K-cycles.
78140 K-cycles.
80163 K-cycles.
78618 K-cycles.
78194 K-cycles.
78182 K-cycles.
78308 K-cycles.
78140 K-cycles.
78252 K-cycles.
78385 K-cycles.
78489 K-cycles.
78404 K-cycles.
78620 K-cycles.
Complete test (box pruning): found 11811 intersections in 78140 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 824010 K-cycles.
75750 K-cycles.
76740 K-cycles.
75640 K-cycles.
75539 K-cycles.
75800 K-cycles.
78633 K-cycles.
80828 K-cycles.
82691 K-cycles.
74491 K-cycles.
74675 K-cycles.
75398 K-cycles.
73778 K-cycles.
75074 K-cycles.
79776 K-cycles.
73916 K-cycles.
74263 K-cycles.
Complete test (box pruning): found 11811 intersections in 73778 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

This doesn’t look like much but it will have profound effects on further optimizations. This change in itself does not make a big difference, but it opens the door for more drastic optimizations like SIMD. We will come back to this in another blog post.

What we learnt:

Hardcoding things can make the code run faster.

From small beginnings come great things. Small optimizations can be worth doing just because they open the door for further, greater optimizations.

For now, before we wrap up this post, there was a bit of a debate left to deal with.

Is it worth the risk of hitting a case where sorting along X makes the code degenerate to O(n^2)?

Yes! Because all broadphase algorithms have bad cases that degenerate to O(n^2) anyway. Just put all the bounds at the origin (each box overlapping all other boxes) and the fastest broadphase algorithm in the world in this case is the brute-force O(n^2) approach. Everything else will have the overhead of the fancy pruning algorithms, that will not be able to prune anything, plus the exact same number of AABB overlap tests on top of it. So the existence of potential degenerate cases is not something to be overly worried about: they exist for all broadphase algorithms anyway.

Is it worth losing the genericity for such minor gains?

Yes! Because it turns out we didn’t entirely lose the genericity. There are two traditional ways to get the performance gains from hardcoding the inputs while keeping the genericity of the non-hardcoded version.

The first way is to use templates. You could rewrite this function with templates, and instantiate a different function for different values of the input axes. This works, but I don’t really like the approach. Moving all the code to headers gives me less control over inlining. And generating multiple versions of the same function can quickly bloat the code and silently increase the size of the exe for no valid reason, which in turn can make things slower simply by decreasing locality between callers and called. These things happen, and these things matter - that’s why you have optimization options in the linker, to let users define the function order within the final executable.

The second way then, is to realize that X, Y and Z are just offsets within the user-defined array. And nothing forces you to put values there in this traditional order.

That is, if you want to sort along Z instead of X, just switch the X and Z values in your input bounds. You don’t even need to duplicate the function: just change the input data. Granted: it is not always possible to modify the input data when the bounds are, say, stored within a game object. Flipping the bounds coordinates there would make the box pruning function work with the preferred sorting axes, but it would also have an effect on every other piece of code accessing the same bounds. Which is not ideal.

So what’s the ideal?

Well, we just gave a hint of things to come when we mentioned tweaking the input data. Optimizing the code is one thing, but optimizing the data is equally important, if not more. Next time we will implement a data-oriented optimization that will not only make the function a lot faster, but also give us back our genericity for no extra cost.

GitHub code for part 5

shopfr.org cialis