Box pruning revisited - part 10 - integer SIMD

February 14th, 2017

Part 10 – integer SIMD

Interlude:

Similar to what we did in version 7, and in sake of completeness, I tried using integer comparisons in the SIMD version as well.

The changes are straightforward: encode floats as integers like we did in version 7, replace SIMD floating-point intrinsics with SIMD integer intrinsics.

There are no special traps here, and not much to report, because at the end of the day the results are slower:

Home PC:

Complete test (brute force): found 11811 intersections in 781780 K-cycles.
17454 K-cycles.
16681 K-cycles.
16669 K-cycles.
17145 K-cycles.
16703 K-cycles.
16683 K-cycles.
16668 K-cycles.
16667 K-cycles.
16862 K-cycles.
16707 K-cycles.
16670 K-cycles.
16689 K-cycles.
16668 K-cycles.
16949 K-cycles.
16710 K-cycles.
16667 K-cycles.
Complete test (box pruning): found 11715 intersections in 16667 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810906 K-cycles.
16607 K-cycles.
15927 K-cycles.
15648 K-cycles.
15971 K-cycles.
15648 K-cycles.
15960 K-cycles.
15742 K-cycles.
15990 K-cycles.
15837 K-cycles.
15741 K-cycles.
15970 K-cycles.
15651 K-cycles.
16247 K-cycles.
15649 K-cycles.
15834 K-cycles.
15738 K-cycles.
Complete test (box pruning): found 11715 intersections in 15648 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Version10

(16667)

-

-

-

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

Version10

(15648)

-

-

-

What we learnt:

Don’t bother with integer comparisons anymore.

We are still using integer SIMD in PhysX for this, so it appears that the PhysX version is sub-optimal. Expect some performance gains in PhysX 3.5.

GitHub code for part 10

Box pruning revisited - part 9c - data alignment

February 14th, 2017

Part 9c – data alignment

Last time we were left with a bit of a mystery. Our optimizations produced consistent results so far, on both machines the code was benchmarked on.

However things changed when switching to SIMD.

We have a naive SIMD implementation (9a) producing these results:

Home PC: 34486 K-cycles.

Office PC: 31864 K-cycles.

And an optimized SIMD implementation (9b) producing these results (for the “safe” version, but timings are similar for the unsafe one):

Home PC: 32565 K-cycles.

Office PC: 15116 K-cycles.

In other words, the change from naive to optimized SIMD had almost no effect on one machine, but it made things run twice faster on the other machine.

By now you should know what to do next: check the disassembly. Recall that the naive version looked like this:

The optimized version (9b) however looks much better:

As expected the scalar loads (movss) and unpacking instructions (unpcklps) all vanished, replaced with a single “proper” SIMD load (movups). In both cases the last 4 instructions are similar, and implement the comparison / test / branch that didn’t change between the naive & optimized functions.

In other words, since the timings are similar, it looks like on the home PC the single movups load is as costly as the large block of 14 movss/unpcklps from the naive version.

I suppose that’s a lesson in itself: all instructions don’t have the same cost. The number of instructions is not a reliable way to evaluate performance.

At this point you could fire up a profiler like V-Tune which would probably reveal what’s going on (I didn’t try), or you could just observe, formulate a theory, and design an experiment to test the theory. You know, the scientific method.

Observe:

movups xmm0,xmmword ptr [eax+8]

There’s only one line to observe really so it doesn’t take a rocket scientist to spot the only potential issue here: that u is for unaligned, and it should ideally be an a, for aligned.

That is, we should have a movaps instead of a movups here. In intrinsics-speak, we’ve been using _mm_loadu_ps while we should ideally have used _mm_load_ps.

Could unaligned loads have such an impact? Well that’s our theory now. Let’s test it.

We used unaligned loads because we had no choice: the size of our AABB classes was 24 bytes (6 floats). To be able to use aligned loads, we’d need a multiple of 16 bytes.

So we could add some padding bytes to reach sizeof(AABB)==32, but that would consume more memory, which means also more cache misses.

Or we could just split our AABB in two separate classes, like this:

We would then have to allocate two internal arrays instead of one to store the bounds:

On one hand we would then have to compute two addresses within our inner loop instead of one, to load data from two different memory addresses instead of one before:

That’s likely to be slower. On the other hand we would then be able to use aligned loads for the YZ part, and that should be a win.

In some cases the quickest way to know is simply to try. The disassembly becomes:

No movaps!

But the compiler was able to merge the move directly in the cmp instruction, which is similar - it’s an aligned load.

And the results apparently confirm our theory:

Home PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 781500 K-cycles.
17426 K-cycles.
16272 K-cycles.
16223 K-cycles.
16239 K-cycles.
16224 K-cycles.
16226 K-cycles.
16289 K-cycles.
16248 K-cycles.
16448 K-cycles.
16240 K-cycles.
16224 K-cycles.
16398 K-cycles.
16251 K-cycles.
16485 K-cycles.
16239 K-cycles.
16225 K-cycles.
Complete test (box pruning): found 11715 intersections in 16223 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 781569 K-cycles.
15629 K-cycles.
14813 K-cycles.
14841 K-cycles.
14827 K-cycles.
14803 K-cycles.
14933 K-cycles.
14821 K-cycles.
14803 K-cycles.
15048 K-cycles.
14833 K-cycles.
14803 K-cycles.
14817 K-cycles.
14802 K-cycles.
14802 K-cycles.
14802 K-cycles.
14998 K-cycles.
Complete test (box pruning): found 11811 intersections in 14802 K-cycles.

Office PC:

Unsafe version:

Complete test (brute force): found 11811 intersections in 826531 K-cycles.
13634 K-cycles.
12709 K-cycles.
12707 K-cycles.
13063 K-cycles.
12858 K-cycles.
13824 K-cycles.
13357 K-cycles.
13502 K-cycles.
12863 K-cycles.
13041 K-cycles.
12711 K-cycles.
12932 K-cycles.
12711 K-cycles.
12709 K-cycles.
12928 K-cycles.
12851 K-cycles.
Complete test (box pruning): found 11715 intersections in 12707 K-cycles.

Safe version:

Complete test (brute force): found 11811 intersections in 813914 K-cycles.
13582 K-cycles.
12562 K-cycles.
12825 K-cycles.
13048 K-cycles.
12806 K-cycles.
12562 K-cycles.
12915 K-cycles.
12627 K-cycles.
13216 K-cycles.
12899 K-cycles.
13546 K-cycles.
13163 K-cycles.
12572 K-cycles.
12769 K-cycles.
12662 K-cycles.
12938 K-cycles.
Complete test (box pruning): found 11811 intersections in 12562 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Version9c - unsafe

16223

~16000

~50%

~6.09

Version9c - safe

14802

~17000

~54%

~6.67

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

Version9c - unsafe

12707

~2300

~15%

~7.30

Version9c - safe

12562

~2500

~16%

~7.39

So the code on the home PC became about twice as fast as before, which makes it roughly as fast as what we got on the office PC for the previous version.

On the office PC, we only saved about 2500 K-Cycles, which means the aligned loads are indeed faster but not that much faster.

In any case, the important point is that our SIMD version is finally always faster than our best scalar version so far (version 8). Pheeeew!

Mystery solved?

Maybe, maybe not.

One question that pops up now is: why is the “safe” version measurably faster than the “unsafe” version on the home PC (14802 vs 16223), even though the safe version does a little bit more work? What’s going on here?

I was ready to move on but that stuff caught my eyes. I couldn’t explain it so I stayed a bit longer. Observed. And designed a new test to confirm the theory that the mystery had indeed been solved.

If the unaligned loads were indeed responsible for making the code 2X faster, then switching back to movups in version 9c (while keeping the separate classes) should also bring back the performance of version 9b – on the home PC in particular.

But. It. Did. Not.

On the home PC, changing the _mm_load_ps to _mm_loadu_ps (while keeping the separate classes, i.e. while keeping the data buffers 16-byte aligned) gave these results:

Safe version:

Complete test (box pruning): found 11811 intersections in 16513 K-cycles.

Unsafe version:

Complete test (box pruning): found 11715 intersections in 19180 K-cycles.

i.e. the cost of unaligned loads for the safe version was 16513 - 14802 = 1711 K-Cycles.

And the cost of unaligned loads for the unsafe version was 19180 - 16223 = 2957 K-Cycles.

That last number is pretty similar to the cost of unaligned loads on the office PC (as we mentioned, ~2500 K-Cycles). So that number is pretty consistent, and seems to indicate the real cost of movups vs movaps: roughly 2000 to 3000 K-Cycles, on both machines.

But then…

why did we go from ~32000 to ~16000 K-Cycles between versions 9b and 9c on the home PC? Where is that coming from, if not from movups?

The mystery was not solved, after all.

New questions, new theories, new tests. The only other change we did was the switch from “new” to “_aligned_malloc“. The SIMD_AABB_YZ has a force-inlined empty ctor, so this couldn’t be a case of calling a slow compiler-generated ctor a large number of times (this things happen). Just to test the theory that it had something to do with this change though, I switched from “new” to “_aligned_malloc” in version 9b (without splitting the classes).

And lo and behold, this change alone made version 9b go from ~32000 K-Cycles to ~18000 K-Cycles at home (for both safe & unsafe versions).

So that’s where the big performance drop was coming from!

After this line was identified, figuring out the “why” was a walk in the park. In short: the performance of movups depends on the alignment of the returned address.

In version 9b, on the home PC:

  • If the bounds array is aligned on a 4-bytes boundary, the function takes ~33000 K-Cycles.
  • If the bounds array is aligned on an 8-bytes boundary, the function takes ~19000 K-Cycles.
  • If the bounds array is aligned on a 16-bytes boundary, the function takes ~18000 K-Cycles.

So there is a huge hit for the 4-bytes boundary case, and we were previously unknowingly hitting it. Switching to _aligned_malloc, or using __declspec(align(N)) on the SIMD_AABB class, fixes the issue in version 9b.

Let’s recap for the home PC:

Instruction

Alignment

Version

Timings

movups

data 4-bytes aligned

single class (version9b)

33000

movups

data 8-bytes aligned

single class (version9b)

19000

movups

data 16-bytes aligned

single class (version9b)

18000

movups

data 16-bytes aligned

separate classes (version9c safe)

16000

movups

data 16-bytes aligned

separate classes (version9c unsafe)

19000

movaps

data 16-bytes aligned

separate classes (version9c safe)

14000

movaps

data 16-bytes aligned

separate classes (version9c unsafe)

16000

What this all means is that the initial theory was actually correct (the performance cost was indeed because of unaligned loads), but there was a subtlety in there: there’s the cost of the movups instruction itself compared to movaps (which is constant at ~2000 / 3000 K-Cycles) and the cost of the data-loading with movups (which is variable and sometimes considerably more expensive).

We also see that version 9b 16-bytes aligned can be faster than version 9c using movups (18000 vs 19000). That is probably because as we mentioned, version 9c needs to compute two addresses now.

And then finally, with movaps, the “safe” version 9c is still faster than the “unsafe” version 9c.

Why?

I still don’t know.

I would have to try V-Tune at that point. For now, I will be happy to admit that I don’t have an answer, and I don’t see what test I could add to find one.

In any case the performance difference is not large, and since the fastest version is also the safest, there is little motivation to go to the bottom of it.

But the main reason for not bothering is that, spoiler: that difference will vanish in the next post anyway.

What we learnt:

SIMD is hard. It took 3 versions for it to become faster than our scalar code.

The cost of unaligned loads can vary greatly depending on data alignment.

Data alignment has an impact on performance even with unaligned loads.

GitHub code for part 9c

Box pruning revisited - part 9b - better SIMD

February 14th, 2017

Part 9b – better SIMD

Last time we wrote an initial SIMD version that ended up slower than the scalar code. We then identified the _mm_set_ps intrinsics as a potential source of issues.

So today we want to replace _mm_set_ps with “real” SIMD loads: _mm_load_ps (for aligned loads) or _mm_loadu_ps (for unaligned loads). These are the intrinsics that each map to a single assembly instruction (respectively movaps and movups). They load 4 contiguous floats into a SIMD register, so we will need to reorder the data. If we go back to the initial C++ code and rearrange it so that all elements of the same box are on the same side, we get:

i.e. the a’s are all on the left side, and the b’s are all on the right side.

This is going to be tricky since we have a mix of “greater than” and “lesser than” comparisons in the middle. Let’s ignore it for now and focus on the loads.

For a given box, we want to load the y’s and z’s (both min and max) with a single load. Unfortunately the AABB class so far looks like this:

That is, switching back to plain floats:

With this format we cannot load the y’s and z’s at the same time, since they’re interleaved with the x’s. So, we change the format, and introduce this new AABB class:

It’s the same data as before, we just put the x’s first. That way the 4 floats we are interested in are stored contiguously in memory, and we can load them to an SSE register with a single assembly instruction. Which gives us this in-register layout, for two boxes:

__m128 box0 (a)

mMinY

mMinZ

mMaxY

mMaxZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Go back to the C++ code and you’ll see that we need to compare minimums of (a) to maximums of (b). So we need to reorder the elements for one of the two boxes.

Since one of the box (say box0) is constant for the duration of the loop, it makes more sense to reorder the constant box (only once), instead of doing the reordering inside the loop. This is done with a single shuffle instruction (_mm_shuffle_ps), and we end up with:

__m128 box0 (a)

mMaxY

mMaxZ

mMinY

mMinZ

__m128 box1 (b)

mMinY

mMinZ

mMaxY

mMaxZ

Just to see where we stand, let’s do a SIMD comparison (_mm_cmpgt_ps) between (a) and (b). This computes:

a.mMaxY > b.mMinY

a.mMaxZ > b.mMinZ

a.mMinY > b.mMaxY

a.mMinZ > b.mMaxZ

The two last lines are actually directly what we need:

a.mMinY > b.mMaxY => that’s (1)

a.mMinZ > b.mMaxZ => that’s (3)

The two other lines compute almost what we want, but not quite.

a.mMaxY > b.mMinY => that’s the opposite of (2)

a.mMaxZ > b.mMinZ => that’s the opposite of (4)

Well the opposite is not bad, it also gives us the information we need, just by flipping the result. The flipping is virtual and free: we are not going to actually do it. It just means that we test the movemask result against a different expected value. That is, if the previous expected value was, say, 1111b in binary, we can just flip two of the bits for which we get the opposite results, and test against, say 1100b instead. It gives us the same information, and the proper overlap test results…

…except in the equality case.

If a.mMaxY==b.mMinY for example:

  • the C++ code tests “a.mMaxY < b.mMinY” and returns 0, which is the desired reference result.
  • the SIMD code tests “a.mMaxY > b.mMinY” and also returns 0, which gives us a 1 after flipping.

Replacing the “greater than” comparison with “greater or equal” wouldn’t work, because what we would really need is a SIMD comparison that performs the former on two of the components, and the later on the two others. So, the way it stands, the SIMD code is equivalent to this:

Almost what we wanted!

And in fact, we could very well accept the differences and stop here.

After all, the equality case is often a bit fuzzy. Don’t forget that we are dealing with floats here. It is bad form to use exact equality when comparing floats. When using the box pruning function for the broadphase in a physics engine, the equality case is meaningless since the bounding boxes are constantly evolving, recomputed each frame, and there is always a bit of noise in the results due to the nature of floating point arithmetic. The results will depend on how the compiler reorganized the instructions, whether x87 or SIMD is used, it will depend on the internal FPU accuracy of x87 registers, i.e. the state of the FPU control word, and so on. Basically the case where the resulting bounding boxes are exactly touching is so ill-defined that it really doesn’t make any difference if the SIMD code uses “>” or “>=” on some of the components.

If we accept this limitation, we can run the test again and see the results:

Home PC:

Complete test (brute force): found 11811 intersections in 781912 K-cycles.
33554 K-cycles.
32514 K-cycles.
32803 K-cycles.
32498 K-cycles.
32486 K-cycles.
32487 K-cycles.
32499 K-cycles.
32488 K-cycles.
32669 K-cycles.
32672 K-cycles.
32855 K-cycles.
32537 K-cycles.
32504 K-cycles.
32477 K-cycles.
32674 K-cycles.
32505 K-cycles.
Complete test (box pruning): found 11725 intersections in 32477 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 808659 K-cycles.
15775 K-cycles.
15266 K-cycles.
15097 K-cycles.
15302 K-cycles.
15259 K-cycles.
15554 K-cycles.
15560 K-cycles.
16137 K-cycles.
15709 K-cycles.
15109 K-cycles.
15317 K-cycles.
15149 K-cycles.
15238 K-cycles.
15239 K-cycles.
15525 K-cycles.
15097 K-cycles.
Complete test (box pruning): found 11725 intersections in 15097 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b

32477

~2000

~5%

~3.04

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b

15097

~16000

~52%

~6.15

WO-AH.

Ok, pause, that’s too much weirdness at the same time.

First, the reported number of intersections is wrong now (11725 vs 11811). That code is broken!

Second, the code is not much faster than the naive version 9a at home. Why?!

Third, the code is now 2X faster than the naive version on the office PC. What’s going on?!

Well, that’s SIMD for you. Things can become a little bonkers.

Let’s talk about the number of overlapping pairs first. It should be expected that the numbers are going to be different between the brute-force and optimized versions now. The brute-force version still uses this:

While the optimized version effectively uses this:

It doesn’t compute the same thing, as we previously discussed, so, obviously, the reported number of overlapping pairs can be different. It simply means that some of the boxes in the set are exactly touching.

Beyond this, there is however a much more subtle reason at play here. If you change the AABB::Intersect() function so that it handles the equality case like the SIMD code, you still don’t get the same number of pairs. That’s because the SIMD code has another issue we didn’t mention so far: it is not symmetric anymore. That is, Intersect(A, B) != Intersect(B, A) in the equality case. Imagine two boxes for which a.mMaxY == b.mMinY. Since the code tests “a.mMaxY <= b.mMinY”, we get a positive result here. Now swap a and b. We now have b.mMaxY == a.mMinY, but the code tests “a.mMinY > b.mMaxY”, which returns a negative result. What this means is that the results will depend on the order in which the boxes are tested by the functions.

For the brute-force function we always use the same order, and who is “a” or “b” never changes. But the optimized code sorts the boxes according to X, so who is “a” or “b” changes depending on the box positions. So we cannot easily compare the two versions anymore, which makes testing and validating the code a bit problematic. One way to do it is to explicitly avoid the equality case in the input set, in which case the returned number of intersections becomes the same again for the two functions. Another way to do it is to pre-sort the boxes along X, to ensure that the brute-force code will process the boxes in the same order as the SIMD code. If you do all that (change the AABB::Intersect() function, pre-sort the boxes), the validity tests always pass and the number of pairs is always the same for the two functions. The code is not broken!

However, you may find these difficulties a bit too much to swallow. Fair enough. Since the SIMD function is harder to test and validate that before, we labeled it “unsafe”. And before discussing the performance results, we’re going to see how to make it “safe”, i.e. how to make it compute exactly the same overlaps as the scalar code, in all cases.

Let’s go back to the initial scalar code:

Right, so the problem was that with a’s and b’s on the same side, we get a mix of “<” and “>” operators. To fix this, we could just multiply some of the components by -1. Which would give:

And… that just works. It adds some overhead though, in particular one extra load and one extra multiply within the inner loop (since we need to fix both boxes).

But we can improve this. Since version 6b we are parsing internal bounds arrays in the main loop, rather than the user’s input array. So we are free to store the bounds in any form we want, and in particular we can pre-multiply mMin.y and mMin.z by -1.0. That way we don’t need to do the multiply on box (b) within the inner loop.

On the other hand we still need to flip the signs of mMax.y and mMax.z for box (a), and since we pre-multiplied the mins (which has an effect on box (a) as well, it comes from the same buffer), the net result is that we now need to multiply the whole box (a) by -1.0 at runtime. This makes the code simpler, and since box (a) is constant for the duration of the inner loop, the impact on performance is minimal. In the end, the fix to go from the unsafe version to the safe version is just two additional instructions (_mm_load1_ps + _mm_mul_ps) on the constant box. That’s all!

And the impact on performance is invisible:

Home PC:

Complete test (brute force): found 11811 intersections in 781499 K-cycles.
34077 K-cycles.
32779 K-cycles.
32584 K-cycles.
33237 K-cycles.
32656 K-cycles.
32618 K-cycles.
33197 K-cycles.
32662 K-cycles.
32757 K-cycles.
32582 K-cycles.
32574 K-cycles.
32579 K-cycles.
32592 K-cycles.
32565 K-cycles.
32867 K-cycles.
32591 K-cycles.
Complete test (box pruning): found 11811 intersections in 32565 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 822472 K-cycles.
16558 K-cycles.
16051 K-cycles.
15127 K-cycles.
15324 K-cycles.
15667 K-cycles.
15169 K-cycles.
15117 K-cycles.
15324 K-cycles.
15117 K-cycles.
15902 K-cycles.
16729 K-cycles.
15860 K-cycles.
15937 K-cycles.
15218 K-cycles.
15356 K-cycles.
15116 K-cycles.
Complete test (box pruning): found 11811 intersections in 15116 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Version9b - unsafe

32477

~2000

~5%

~3.04

Version9b - safe

32565

~1900

~5%

~3.03

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

Version9b - unsafe

15097

~16000

~52%

~6.15

Version9b - safe

15116

~16000

~52%

~6.14

This is basically the same speed as the unsafe version. Yay!

We are still going to keep the two unsafe and safe versions around for a while, in case further optimizations work better with one of them.

Note how all of this came together thanks to previous optimizations:

  • We were able to use a SIMD comparison because we dropped the user-defined axes in version 5.
  • We were able to create a SIMD-friendly AABB class because we now parse internal buffers instead of the user’s buffers, after version 6b.
  • For the same reason we were able to go from the unsafe version to the safe version “for free” because we could store our boxes pre-flipped. This was made possible by version 6b as well.

That’s quite enough for one post so we will investigate the remaining mystery next time: why are the timings so different between the home PC and the office PC?

What we learnt:

The same SIMD code can have very different performance on different machines.

Therefore it is important to test the results on different machines. If you would only test this on the home PC, after writing two SIMD versions (9a / 9b) that end up slower than a scalar version (8), you might be tempted to give up. You need to see the results on the office PC to remain motivated.

There’s a ripple effect with optimizations: some of them may have value not because of their limited performance gains, but simply because they open the door for more optimizations further down the road.

GitHub code for part 9b

Box pruning revisited - part 9a - SIMD

February 14th, 2017

Part 9a - SIMD

For this post, we start again from version 6b, ignoring our experiments with integer comparisons and branches in versions 7 and 8. That is, we go back to this version:

The reason for doing so is simply that the above function is easier to work with: these 4 comparisons are a perfect fit for SIMD. So we will try to replace them with a single SIMD comparison, and thus, a single branch. We can skip the work we did in version 8 since SIMD will replicate this naturally.

After the modification we did in part 5, we now read y’s and z’s directly there, and the code is ripe for SIMDifying. It would have been much more difficult to see or consider doing with the initial implementation, where the axes came from a user-defined parameter. So this is the moment where we harvest the “great things coming from small beginnings”.

Doing so is not entirely trivial though. There is SIMD and SIMD. Converting something to SIMD does not guarantee performance gains. A naive conversion can in fact easily be slower than the scalar code you started from.

But let’s give it a try. We want to compute:

So we could just load these elements in two SSE registers, do a single SSE comparison, use movemask to get the results, and… done? The modifications are very simple, just replace this:

With this:

It is straightforward and it works:

Home PC:

Complete test (brute force): found 11811 intersections in 781594 K-cycles.
38231 K-cycles.
34532 K-cycles.
34598 K-cycles.
34996 K-cycles.
34546 K-cycles.
34486 K-cycles.
34649 K-cycles.
34510 K-cycles.
34709 K-cycles.
34533 K-cycles.
34496 K-cycles.
34621 K-cycles.
34556 K-cycles.
34867 K-cycles.
35392 K-cycles.
34666 K-cycles.
Complete test (box pruning): found 11811 intersections in 34486 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810655 K-cycles.
32847 K-cycles.
32021 K-cycles.
32046 K-cycles.
31932 K-cycles.
32283 K-cycles.
32139 K-cycles.
31864 K-cycles.
32325 K-cycles.
33420 K-cycles.
32087 K-cycles.
31969 K-cycles.
32018 K-cycles.
32879 K-cycles.
32588 K-cycles.
32214 K-cycles.
31875 K-cycles.
Complete test (box pruning): found 11811 intersections in 31864 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

(Version8)

(31383)

(~10000)

(~24%)

(~3.14)

Version9a

34486

~7100

~17%

~2.86

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

(Version8)

(29083)

(~16000)

(~36%)

(~3.19)

Version9a

31864

~13000

~30%

~2.91

The gains for version 9a are relative to version 6b (we ignored versions 7 and 8). And we see something interesting here: our SIMD version “worked”, it did provide clear gains compared to version 6b, but… it remains slower than our branchless scalar version, version8.

Oops.

This is the source of never-ending myths and confusion about SIMD on PC: if you started from version 6b, you are likely to conclude that adding SIMD is easy and provides easy gains. If you started from version 8, you are likely to conclude that SIMD is slower than “optimized” scalar code, and not worth the trouble. And if you started from version 8, “optimized” the code with SIMD, and concluded without benchmarking that Version9a was faster, then you are a bloody fool.

In any case, you’d all be wrong. As previously mentioned, there is SIMD and SIMD. Our initial attempt here is just a little bit rough and, indeed, suboptimal.

How do we see it?

As usual: always check the disassembly!

If you do, you will discover that the above SIMD code, which is what I would call the naive version, gave birth to this monstrosity:

Soooo… errrr… fair enough: that’s only one comparison (cmpltps) and one branch (test/jne) indeed. But that’s still quite a bit more than what we asked for.

In particular, one may wonder where the unpack instructions (unpcklps) are coming from - we certainly didn’t use _mm_unpacklo_ps in our code!

So what’s going on?

Well, what’s going on is that we fell into the intrinsics trap. There is usually a one-to-one mapping between intrinsics and assembly instructions, but not always. Some of them are “composite” intrinsics that can generate an arbitrary number of assembly instructions - whatever it takes to implement the desired action. This is the case for _mm_set_ps, and this is where all the movss and unpcklps are coming from. The two _mm_set_ps intrinsics generated 14 assembly instructions here. A lot of people assume that using intrinsics is “the same” as writing in assembly directly: it is, until it isn’t. Sometimes it hides a lot of important details behind syntactic sugar, and that can bite you hard if you are not aware of it.

So here’s our lesson: if your SIMD code uses _mm_set_ps, it is probably not optimal.

While we’re at it, there is another lesson to learn here: the compiler won’t do the SIMD job for you. We did not have to reorder the data here, we did not have to align structures on 16-byte boundaries, we did not have to do complicated things, the change was as straightforward as it gets.

And yet, even in this simple case, with the /SSE2 flag enabled, the compiler still didn’t generate the naive SIMD version on its own. So once again: people claiming that using the /SSE2 flag will automatically make the code “4x faster” (as I read online a few times) are tragically, comically wrong. For that kind of speedup, you need to work hard.

And work hard, we will.

Back to the drawing board.

What we learnt:

Writing something in SIMD does not guarantee better performance than a scalar version.

There’s no 1:1 mapping between intrinsics and assembly instructions when you use “composite” intrinsics.

Don’t use _mm_set_ps.

The compiler will not even write naive SIMD for you.

GitHub code for part 9a

Box pruning revisited - part 8 - branchless overlap test

February 14th, 2017

Part 8 – branchless overlap test

For this post, we start again from version 6b, ignoring our experiments with integer comparisons in version 7. We are going to stay focused on comparisons though, and looking at this function:

It contains four comparisons, and four branches. The box values are going to be arbitrary, so these branches are the worst you can get: branch prediction will essentially never work for them. If we check the disassembly, it looks like this:

Pretty much what you’d expect from the C++ code, and all the comparisons and branches are there indeed.

Comparisons and branches. We often talk about these two as if they would always go hand in hand and come together, but they are very different things and it’s important to distinguish between the two. In terms of performance, one of them is more expensive than the other.

Let me demonstrate. We can rewrite the overlap function this way:

Compare to the previous function, and you easily see that the comparisons are still there in the C++ code. We still compare max floats to min floats, four times, none of that has changed. We just wrote the code slightly differently.

Compile it, run it, and… You get exactly the same timings as before. In fact, you get exactly the same disassembly. Thus, the branches are also all there. Well really, that makes sense.

Now, however, for something that perhaps does not make as much sense, remove three characters from this function and write it this way instead:

Pretty much the same, right?

Well, not quite. The compiler begs to differ, and produces the following disassembly:

The comparisons are still here (the “comiss”) but the branches are mostly gone: there’s only one branch instead of four. That’s all thanks to the “seta” instructions, which can store the results of our comparisons directly in a register, in a branchless way. Technically the final branch is not even in the overlap function itself, it’s just part of the calling code that uses the returned value. The overlap function itself became branchless.

Using | instead of || told the compiler it was necessary to perform all comparisons to create the final bool, while the previous version had early-exits (and sometimes only performed one comparison instead of four). The new code is also larger, since it needs to concatenate the bools to create the final return value. But using more instructions does not always mean slower results:

Home PC

Complete test (brute force): found 11811 intersections in 778666 K-cycles.
32497 K-cycles.
31411 K-cycles.
31402 K-cycles.
31637 K-cycles.
31415 K-cycles.
31440 K-cycles.
31404 K-cycles.
31412 K-cycles.
31388 K-cycles.
31596 K-cycles.
31411 K-cycles.
31383 K-cycles.
31595 K-cycles.
31405 K-cycles.
31383 K-cycles.
31392 K-cycles.
Complete test (box pruning): found 11811 intersections in 31383 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 819311 K-cycles.
29897 K-cycles.
29563 K-cycles.
29620 K-cycles.
29430 K-cycles.
29641 K-cycles.
29352 K-cycles.
29363 K-cycles.
29305 K-cycles.
30214 K-cycles.
29538 K-cycles.
31417 K-cycles.
30416 K-cycles.
30112 K-cycles.
30443 K-cycles.
30105 K-cycles.
29083 K-cycles.
Complete test (box pruning): found 11811 intersections in 29083 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

(Version7)

(40906)

-

-

-

Version8

31383

~10000

~24%

~3.14

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

(Version7)

(43987)

-

-

-

Version8

29083

~16000

~36%

~3.19

Well, we are now 3 times faster than when we started. That’s another milestone I suppose.

It is important to note that going branchless does not always pay off. Branches are expensive if mispredicted, but cheap if correctly predicted. Also, a branch that early-exits and skips an expensive part of the code can still be cheaper overall than running the expensive part of the code. It’s obvious, but multiple times I’ve seen people rewrite functions in a branchless way, and just assume that the result was better - without benchmarking. Unfortunately the results were often slower. There is absolutely no guarantee here. If you blindly follow a rule or a recipe, removing branches just for the sake of removing branches, you won’t go very far. This is a case where you need to test, profile, use your brain.

What we learnt:

Comparisons and branches are two different things.

Mispredicted branches are costly, rewriting in a branchless way can help here.

Don’t assume the results will be faster. Profile the difference.

The C++ versions can look very similar (only 3 extra characters) but have a very different disassembly, and very different performance. You need to look through the C++, what matters is the generated code.

Are we done?

No!

Next time we will finally attack the SIMD version.

GitHub code for part 8

Box pruning revisited - part 7 - integer comparisons

February 14th, 2017

Part 7 – integer comparisons

I spent a lot of time using PIX on the Xbox360. This is a profiler tool, which incidentally was just released on PC (I did not try it yet, I don’t know how it compares to the Xbox version).

On the Xbox it typically identified and reported 3 main problems: L2 cache misses, FCMPs, and LHS.

Well, granted: it reported a lot more than this, but these guys were the typical issues one had to pay attention to.

We just took care of L2 cache misses, or data cache misses, in the previous posts.

LHS (for Load-Hit-Store) were a nuisance specific to the Xbox360 that we will ignore for now. They do exist on PC, but their performance penalty is not as severe as on the previous-gen Xbox.

So that leaves the FCMPs, i.e. the float comparisons. Weird thing, the FCMPs. They used to be very costly on old Pentiums IIRC. And then I think for a while they were ok. And then with the Xbox360 suddenly they were bad again. These things come and go.

This gave birth to a whole bunch of optimizations where FCMPs got replaced with regular integer comparisons, which were faster. It made a huge difference on the 360, and I kept doing this kind of stuff as a habit. But to be fair, this may not be needed anymore on modern PCs. And that’s what we are going to investigate today.

We have to do it now, before starting the SIMD work, because it might also affect the SIMD code. There is floating point SIMD and integer SIMD, and with SSE2 the instruction sets are quite different. It is unclear if there would be a benefit using one or the other at this point, so we will investigate, and maybe try both.

But SIMD will bring some complications to the table, so it is easier to start slowly and first play with integer comparisons while the code is still scalar.

I already wrote about this issue in the already mentioned SAP document. Go read Appendix A this time, page 22. I’ll wait.

Back? Good.

As mentioned in the document, the changes are quite straightforward. We replace our AABB with its integer counterpart:

Then we encode its min & max values using the function from the SAP document:

And then we just replace some “float” with “udword” in the main loop, float comparisons silently become integer comparisons, and that’s pretty much it really. Once you understand the encoding function, the modifications are trivial. Well, as long as we’re not using SIMD at least.

The results are as follows:

Home PC:

Complete test (brute force): found 11811 intersections in 782125 K-cycles.
41727 K-cycles.
41126 K-cycles.
40956 K-cycles.
41431 K-cycles.
41042 K-cycles.
41169 K-cycles.
40928 K-cycles.
40916 K-cycles.
40912 K-cycles.
40906 K-cycles.
41116 K-cycles.
40927 K-cycles.
40911 K-cycles.
40969 K-cycles.
41143 K-cycles.
41251 K-cycles.
Complete test (box pruning): found 11811 intersections in 40906 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 823453 K-cycles.
44674 K-cycles.
45127 K-cycles.
45929 K-cycles.
44652 K-cycles.
44192 K-cycles.
44502 K-cycles.
45403 K-cycles.
44062 K-cycles.
44859 K-cycles.
44060 K-cycles.
44679 K-cycles.
43987 K-cycles.
44537 K-cycles.
44417 K-cycles.
44842 K-cycles.
44395 K-cycles.
Complete test (box pruning): found 11811 intersections in 43987 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

Version7

40906

-

-

-

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

Version7

43987

-

-

-

Sooooo….. That one is not very conclusive. It appears to be a tiny bit faster, but this is far from the speedups we used to get on the Xbox.

I would say that the jury is still out on this one, but either way, the speedups are too small to justify the trouble at this point. So we will ignore integer comparisons for now and perhaps revisit this later with SIMD.

The results are curious though, because this stuff used to matter a lot. What happened?

Again, you can design a small experiment to reveal the truth. The only difference between version 6b and version 7 is the switch from floating-point to integer-comparisons.

So what we can do is compile both without the SSE2 flag, i.e. with the explicit /arch:IA32 flag to ensure that x87 code is used.

Here are the results on the home PC:

Version 6b with /SSE2 flag

41605 K-cycles

Version 6b with x87 code

52163 K-cycles

Version 7 with /SSE2 flag

40906 K-cycles

Version 7 with x87 code

40897 K-cycles

Ah-ah! So that was it, look at this!

With the x87 code, switching from float to integer comparisons has a huge impact.

With the SSE2 code, switching from float to integer comparisons has almost no impact.

Using integer comparisons with the x87 code gave the same speedup as switching to SSE2 code with float comparisons.

So that’s what happened: using integer comparisons gave big speedups in the past with x87 code (and even more so on the Xbox 360), but compiling with the /SSE2 flag gives the same speedups out-of-the-box, making this optimization obsolete.

And that’s our lesson of the day: don’t assume anything. Question everything. Constantly revisit your previous conclusions.

We will now ignore version 7 entirely: the minor gains are not worth the trouble.

What we learnt:

Replacing float comparisons with integer comparisons in scalar code might still provide some minor gains, but not as much as in the past. Using the SSE2 flag now gives essentially the same speedups out-of-the-box.

One great optimization one day on one platform might become worthless (or even detrimental) another day, on another platform.

You can learn something even from a “failed” optimization. Things that did not work are also worth reporting and talking about.

Ok! With this question out of the way, we’re ready to tackle the SIMD version.

See you next time.

GitHub code for part 7

Box pruning revisited - part 6b - less cache misses

February 14th, 2017

Part 6b – less cache misses

Last time we took care of our input data, and packed it in a nice flat linear array. This is necessary but not sufficient to guarantee high performance: it only helps if you actually read that input array sequentially. But if you read random boxes within the input array, you still get cache misses.

But do we still get cache misses? We got great speedups with our previous change so aren’t we done? How do we know if there is more to gain from “the same” optimization?

One way is simply to look at the code. With some experience you can relatively easily spot the potential problems and “see” if the code is sound or not.

Another way would be to use a profiler like V-Tune, and check how many cache misses you get in that function.

And sometimes you can simply design small experiments, small tests that artificially remove the potential problems. If the performance doesn’t change, then you know you’re good, nothing left to do. If the performance does change, then you know you have some more work ahead of you.

This probably sounds a bit abstract so let me show you what I mean. The code is like this now:

So we’re going to read from this badly-named “list” array. We look for “list” in the code. We first find this:

This reads the array sequentially, this is theoretically optimal, so no problem here. Granted: we don’t use all the data in the cache line we just fetched, but let’s ignore this. For now we are just checking whether we read the array sequentially or in random order.

Next we find this:

With:

And “Sorted” is the output of the radix sort, which sorted boxes along the X axis.

Well this one is not good. The radix implementation is for a “rank sorter”, i.e. it returns sorted indices of input bounds. But it does not sort the input array itself (contrary to what std::sort would do, for example). This means that we keep reading random boxes in the middle of the box-pruning loop. Oops.

And the same problem occurs for Index1 later on:

So this is what I mentioned above: you read the code, and you can easily spot theoretical problems.

And then you can sometimes design tests to artificially remove the problem and see if performance changes. In this case we’re lucky because it’s easy to do: we can just pre-sort the input array (the bounds) before calling the box-pruning function. That way the sorting inside the function will have no effect, the radix sort will return an identity permutation, the code will read the input array sequentially, and we will see how much time we waste in cache misses.

Quickly said, quickly done, the change is just like this in Main.cpp:

Change gPresortBounds from false to true, run the test again… and….

Home PC:

Complete test (brute force): found 11811 intersections in 355897 K-cycles.

Complete test (box pruning): found 11811 intersections in 46586 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 312868 K-cycles.

Complete test (box pruning): found 11811 intersections in 49315 K-cycles.

Wo-ah.

Now that’s interesting. The function became faster again. This means that we are still suffering from cache misses out there, and that there is more to do to address this issue.

It is also a great example of why data-oriented design matters: the code did not change at all. It is the exact same function as before, same number of instructions, same position in the executable. It computes the same thing, returns the same number of overlapping pairs.

But somehow it’s 15% / 20% faster, just because we reorganized the input data. Ignore this, focus only on the code, and that’s the amount of performance you leave on the table in this case.

On a side note, we saw this kind of things in the past, in different contexts. For example triangle lists reorganized to take advantage of the vertex cache were faster to render - sometimes faster than strips. Even on older machines: making the code run faster by just changing the data is very reminiscent of what Leonard / Oxygene reported when working on his 16*16 “sprite record” demos on Atari ST. Several times in these experiments the Atari ST code didn’t change at all, but he was able to display more and more sprites by improving his “data builder” (a tool running on a PC). This had nothing to do with cache misses, but it was the same observation: you get the best results by optimizing both the code and the data. In his case, the data was not even optimized on the same machine/architecture.

So, anyway: we identified the problem. Now how do we solve it?

Well, the obvious way: we just create a secondary bounds array, sorted in the desired order, and we parse that sorted array instead of the original one. Basically we simply replicate what we did above to presort bounds, inside the box pruning function.

So, yes, we need more memory. And yes, the setup code becomes more complex (but it was not the bottleneck at all, remember? So it doesn’t matter).

On the other hand, now that the array itself is sorted, there is no need to work with the sorted indices anymore: they will always be the identity permutation. So the setup code becomes more complex but the inner loop’s code can actually be further simplified.

One complication is that we need to introduce a remapping between the user’s box indices and our now entirely sequential internal box indices. Fortunately we already have the remap table: it is the same as the sorted indices returned by the radix sort! So it all fits perfectly together and the only thing we need to do is remap the box indices after an overlap has been found, before writing them to the output buffer. In other words: we removed indirections in the frequent case (for each parsed box), and added indirections in the infrequent case (when an overlap is actually found).

The results speak for themselves.

Home PC:

Complete test (brute force): found 11811 intersections in 781350 K-cycles.
45584 K-cycles.
44734 K-cycles.
41679 K-cycles.
41810 K-cycles.
41654 K-cycles.
41623 K-cycles.
41634 K-cycles.
41605 K-cycles.
41779 K-cycles.
41633 K-cycles.
42037 K-cycles.
41752 K-cycles.
41836 K-cycles.
41639 K-cycles.
41618 K-cycles.
41636 K-cycles.
Complete test (box pruning): found 11811 intersections in 41605 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 813615 K-cycles.
46985 K-cycles.
46282 K-cycles.
47889 K-cycles.
47757 K-cycles.
55044 K-cycles.
52660 K-cycles.
47923 K-cycles.
53199 K-cycles.
47410 K-cycles.
46192 K-cycles.
45860 K-cycles.
46140 K-cycles.
45634 K-cycles.
46274 K-cycles.
47077 K-cycles.
46284 K-cycles.
Complete test (box pruning): found 11811 intersections in 45634 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Version6b

41605

~18000

~31%

~2.37

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

Version6b

45634

~12000

~21%

~2.03

This is even faster than what we got during our previous experiment with pre-sorted bounds. Which means that not only we took care of the cache misses, but also the extra simplifications of the main loop provided additional gains. Bonus!

And with that, we became 2X faster than the version we started from. That’s a milestone.

What we learnt:

The same as in the previous post, but it’s so important it’s worth repeating: cache misses are more costly than anything else. If you can only optimize one thing, that’s what you should focus on.

Before we wrap this one up, let’s come back to the loss of genericity we mentioned before. At the end of part 5 we said that we could get back our generic version for no additional cost. That is, we could put back the user-defined axes and don’t pay a performance price for it.

It should be clear why this is the case. We are now parsing internal buffers in the main loop, instead of the user-provided source array. So we could very easily flip the box coordinates while filling up these internal buffers, in the setup code. The main inner loop would remain untouched, so it would give back the user-defined axes virtually for free.

So that’s one loose end taken care of. But there is another big one left, also mentioned in part 5: SIMD.

We will slowly move towards it in next post.

GitHub code for part 6b

Box pruning revisited - part 6a - data-oriented design

February 14th, 2017

Part 6a – data-oriented design

There has been a lot of talks and discussions in the past about “data-oriented programming” and how it leads to faster code. Just Google the term along with “Mike Acton” and you will find plenty of it. I don’t need to repeat everything here: Mike is right. Listen to Mike.

In practical terms: we need to think about cache misses now. What we do with the data is one thing, but how we access the data is another equally important thing.

As far as our box pruning function is concerned, the data is pretty simple: we read an array of AABB pointers in input, and we write an array of pairs in output.

Reading from and writing to linear arrays is the most cache-friendly thing one can do, so we’re not starting from a bad place (there is no linked lists or complicated structures here). However we are reading an array of AABB pointers, not an array of AABBs. And that extra indirection is probably costly.

I remember using this design for practical reasons: I wanted the function to be user-friendly. At the time I was using a classical “object-oriented” design for my game objects, and I had something like this:

The AABB of each object was embedded within the game object class, vanilla C++ design. And I didn’t want the library to force users to first gather and copy the bounds in a separate bounds array just to be able to call the function. So instead, the function accepts an array of AABB pointers, so that it can read the AABBs directly from the game objects. The AABBs can neatly stay as a member of the objects, there is no need to duplicate them in a separate array and use more memory, no need to break up the neat objects into disjoint classes, it all looked quite nice.

And I suppose it was. Nice.

But fast, no, that, it was not.

Reading the bounds directly from the game objects is an extraordinarily bad idea: it makes you read some scattered data from some random location for each box, it’s the perfect recipe for cache misses and basically the worst you can do.

You can solve it with prefetch calls, to some extents. But prefetching is a fragile solution that tends to break when code gets refactored, and making it work in a cross-platform way is tedious (the different platforms may not even have the same cache line size, so you can easily imagine the complications).

A better idea is to switch to “data-oriented design” instead of “object-oriented design”. In our case it simply means that the bounds are extracted from the game objects and gathered in an array of bounds, typically called “bounds manager”. Something along these lines:

Vanilla data-oriented design, that one.

There are pros & cons to both approaches, which are beyond the scope of this blog post. All I am concerned about here is the performance of the box pruning function, and in this respect the data-oriented design wins big time.

The modifications to the code are minimal. The function signature simply becomes:

And the corresponding indirections are removed from the code. That’s it. That’s the only difference. We don’t “optimize” the code otherwise. But the effect on performance is still dramatic:

Home PC:

Complete test (brute force): found 11811 intersections in 782034 K-cycles.
64260 K-cycles.
60884 K-cycles.
60645 K-cycles.
60586 K-cycles.
60814 K-cycles.
62650 K-cycles.
60591 K-cycles.
60697 K-cycles.
60833 K-cycles.
60579 K-cycles.
60601 K-cycles.
60905 K-cycles.
60596 K-cycles.
60836 K-cycles.
60673 K-cycles.
60585 K-cycles.
Complete test (box pruning): found 11811 intersections in 60579 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 810333 K-cycles.
59420 K-cycles.
58451 K-cycles.
59005 K-cycles.
65448 K-cycles.
64930 K-cycles.
62007 K-cycles.
59005 K-cycles.
58847 K-cycles.
58904 K-cycles.
58462 K-cycles.
58843 K-cycles.
58655 K-cycles.
58583 K-cycles.
59056 K-cycles.
58776 K-cycles.
58556 K-cycles.
Complete test (box pruning): found 11811 intersections in 58451 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Version6a

60579

~17000

~22%

~1.63

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

Version6a

58451

~15000

~20%

~1.58

The modifications to the code have been the smallest so far (really just replacing “->” with “.” here and there), and yet this gave the biggest speedup until now. In other words: our current best optimization did not come from modifying how we process the data (==roughly, “the code”), or from changing the data itself, but instead it came from changing how we access the data. That’s a pretty powerful result.

What we learnt:

Choose data-oriented design over object-oriented design for performance.

The data access pattern is often more important than anything else.

We are not done yet though. There is a lot more to do w.r.t. optimizing our data access, as we will see next time. But this change was so small and yet so effective that it was worth isolating in a separate blog post to drive the point home.

GitHub code for part 6a

Box pruning revisited - part 5 - hardcoding

February 14th, 2017

Part 5 – hardcoding

Last time we saw how making the code less generic could also make it faster. This trend continues in this post.

There is another part of the box-pruning function that looks suspiciously over-generic. In the original code the sorting axes are defined by the user and sent as a parameter to the function:

In the initial code from 2002, I passed the following axes to the function:

Inside the function they are copied to local variables:

After that, the boxes are sorted along the main axis (Axis0), which essentially replicates the X-related part of the AABB::Intersect(const AABB& a) function. In other words, the sorting and scanning procedure emulates this line from AABB::Intersect():

And the remaining AABB::Intersect() tests for Y and Z are implemented using the remaining Axis1 and Axis2 variables:

This is all good and fine. Unfortunately it is also sub-optimal.

It was invisible to me when I first wrote the code, but I know now that fetching the axes from user-defined variables has a measurable impact on performance. I saw it happen multiple times in multiple algorithms. The cost is not always large, but it is always real: fetching the axes is going to be either an extra read from the stack, or one less register available to the optimizer to do its job - and there aren’t a lot of registers in the first place on x86. Either way it makes the addressing mode more complex, and if all these things happen in the middle of an inner loop, it quickly accumulates and becomes measurable.

So a much better idea is simply to hardcode the axes.

Well, a more efficient idea at least. Whether it is “better” or not is debatable.

Let’s see the effect on performance first, and debate later.

So the function sorted the boxes along X, and then did extra overlap tests on Y and Z within the most inner loop. The rationale for this default choice was that the up axis is usually either Y or Z, “never” X. You typically don’t want to sort along your up axis, because most games have a flat, essentially 2D world, even if they’re 3D games. If you project the bounds of such a world along the up axis, the projections will pretty much all overlap each-other, and this will greatly reduce the “pruning power” of the main sort. Basically the algorithm would then degenerate to the O(n^2) brute-force version.

Thus, the safest main sorting axis is X, because it is usually not the vertical axis. After that, and for similar reasons, it is usually slightly better to use the remaining non-vertical axis for “Axis1″ and keep the vertical axis last in “Axis2″ (the 2D overlap test can then early-exit more quickly).

One can also analyze the distribution of objects at runtime and select a different main axis each frame, depending on how objects are spatially organized at any given time. That’s what I did in my old Z-Collide library, eons ago.

Similarly, one could try to make the code more generic by removing the requirements that the input axes are the regular X, Y, Z coordinate axes. After all it is perfectly possible to project the boxes along an arbitrary 3D axis, so the main sorting axis could be anything, and the remaining other two axes could be derived from it in the same way a frame can be derived from just a view vector. Unfortunately making the code more generic in such a way, while “better” in theory (because the main sorting axis could be “optimal”), is a terrible idea in practice. You need to introduce dot-products to project the boxes instead of directly reading the bounds’ coordinates, it adds a lot of overhead, and the resulting code is ultimately slower than what you started with - even with a “better” main sorting axis.

No, as we discussed in part 4, what you need to do for performance is the opposite: make the code less generic.

And hardcode the axes.

Do this, and then that part goes away:

It’s replaced with simply:

Similarly, and especially, this also goes away:

It’s replaced with:

Which is implemented this way (note the hardcoding of axes):

So we completely eliminate the need for passing axes to the function, or reading them from local variables. By reading “.x”, “.y” and “.z” directly, we give the compiler the freedom to simply hardcode the offsets in the address calculation. Let’s look at the resulting disassembly.

It looked like this before the change:

It looks like this after the change:

Did you spot the difference?

“0Ch” is the offset to go from min to max within the AABB class, so you can easily guess that eax in the first movss is the AABB address.

“ecx*4” disappeared entirely. That was the part dealing with the user-defined axes.

So we got the expected results:

  • the addressing mode became simpler (“eax+0Ch” instead of “eax+ecx*4+0Ch”)
  • we got one less read from the stack / one less instruction (“mov ecx, dword ptr [esp+14h]” vanished)
  • register pressure decreased (new code doesn’t use ecx, while initial code did)

The performance increase is minor in this case, but measurable:

Home PC:

Complete test (brute force): found 11811 intersections in 781460 K-cycles.
78971 K-cycles.
78157 K-cycles.
78216 K-cycles.
78319 K-cycles.
78140 K-cycles.
80163 K-cycles.
78618 K-cycles.
78194 K-cycles.
78182 K-cycles.
78308 K-cycles.
78140 K-cycles.
78252 K-cycles.
78385 K-cycles.
78489 K-cycles.
78404 K-cycles.
78620 K-cycles.
Complete test (box pruning): found 11811 intersections in 78140 K-cycles.

Office PC:

Complete test (brute force): found 11811 intersections in 824010 K-cycles.
75750 K-cycles.
76740 K-cycles.
75640 K-cycles.
75539 K-cycles.
75800 K-cycles.
78633 K-cycles.
80828 K-cycles.
82691 K-cycles.
74491 K-cycles.
74675 K-cycles.
75398 K-cycles.
73778 K-cycles.
75074 K-cycles.
79776 K-cycles.
73916 K-cycles.
74263 K-cycles.
Complete test (box pruning): found 11811 intersections in 73778 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Version5

78140

~3600

~4%

~1.26

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Version5

73778

~3300

~4%

~1.25

This doesn’t look like much but it will have profound effects on further optimizations. This change in itself does not make a big difference, but it opens the door for more drastic optimizations like SIMD. We will come back to this in another blog post.

What we learnt:

Hardcoding things can make the code run faster.

From small beginnings come great things. Small optimizations can be worth doing just because they open the door for further, greater optimizations.

For now, before we wrap up this post, there was a bit of a debate left to deal with.

Is it worth the risk of hitting a case where sorting along X makes the code degenerate to O(n^2)?

Yes! Because all broadphase algorithms have bad cases that degenerate to O(n^2) anyway. Just put all the bounds at the origin (each box overlapping all other boxes) and the fastest broadphase algorithm in the world in this case is the brute-force O(n^2) approach. Everything else will have the overhead of the fancy pruning algorithms, that will not be able to prune anything, plus the exact same number of AABB overlap tests on top of it. So the existence of potential degenerate cases is not something to be overly worried about: they exist for all broadphase algorithms anyway.

Is it worth losing the genericity for such minor gains?

Yes! Because it turns out we didn’t entirely lose the genericity. There are two traditional ways to get the performance gains from hardcoding the inputs while keeping the genericity of the non-hardcoded version.

The first way is to use templates. You could rewrite this function with templates, and instantiate a different function for different values of the input axes. This works, but I don’t really like the approach. Moving all the code to headers gives me less control over inlining. And generating multiple versions of the same function can quickly bloat the code and silently increase the size of the exe for no valid reason, which in turn can make things slower simply by decreasing locality between callers and called. These things happen, and these things matter - that’s why you have optimization options in the linker, to let users define the function order within the final executable.

The second way then, is to realize that X, Y and Z are just offsets within the user-defined array. And nothing forces you to put values there in this traditional order.

That is, if you want to sort along Z instead of X, just switch the X and Z values in your input bounds. You don’t even need to duplicate the function: just change the input data. Granted: it is not always possible to modify the input data when the bounds are, say, stored within a game object. Flipping the bounds coordinates there would make the box pruning function work with the preferred sorting axes, but it would also have an effect on every other piece of code accessing the same bounds. Which is not ideal.

So what’s the ideal?

Well, we just gave a hint of things to come when we mentioned tweaking the input data. Optimizing the code is one thing, but optimizing the data is equally important, if not more. Next time we will implement a data-oriented optimization that will not only make the function a lot faster, but also give us back our genericity for no extra cost.

GitHub code for part 5

Box pruning revisited - part 4 - sentinels

February 14th, 2017

Part 4 – sentinels

So we started optimizing the code, first modifying the two inner ‘while’ loops. That was a random place to start with, just the first thing that came to mind. But it was a good one, because as it turns out we’re not done with these two lines yet. Beyond the compiler-related issues, we can also optimize the algorithm itself here.

The two loops now look like this:

In both cases, the first comparisons (”RunningAddress<LastSorted”) are only here to ensure we don’t read past the end of the buffers. But they have no actual value for the algorithm itself. It’s scaffolding to support the actual “meat” of the algorithm. You would never mention that part in pseudocode to explain what the algorithm does. In other words: we could omit the first comparisons and provided it wouldn’t make the code crash, we’d still get the correct results out of the function. This wouldn’t be the same for the second comparisons: without them the algorithm would collapse entirely.

Identifying these “useless” bits is a classical part of optimization for me: I remember using this strategy a lot on the Atari ST to identify which instructions I could target first. That’s half of the battle. That gives you a goal: how do I get rid of them?

Once that goal is clearly expressed and identified, solutions come to mind more naturally and easily. In our case here, a classical solution is to use “sentinel” values, stored within PosList, to ensure that the second comparisons exit the loop when we reach the end of the buffers. This is a fairly standard strategy and I already explained it in the SAP document. So if you are not familiar with it, please take a moment to read about it there (it’s in Appendix B, page 24).

The implementation for the box-pruning function is straightforward. We allocate one more entry for the sentinel:

Then we fill up that buffer as usual for the first ‘nb’ entries. But we write one extra value (the sentinel) at the end of the buffer:

Afterwards it’s as easy as it gets, just remove the first comparisons in the inner loops:

The first comparisons can be removed because we guarantee that we will eventually fetch the sentinel value from PosList, and that value will always be greater than MinLimit and MaxLimit.

Thus this will be enough to exit the loop.

Now, as far as the disassembly is concerned, the changes are a bit hard to follow. The code got re-arranged (check out the position of “call operator delete” in versions 3 and 4 for example) and in its default form the disassembly is a bit obscure. Nonetheless, using the NOPS macro around the while instruction reveals that our change had the desired effect: one of the two comparisons clearly vanished.

More importantly, the benefits were not just theoretical. We also see the performance increase in practice:

Office PC:

Complete test (brute force): found 11811 intersections in 815456 K-cycles.
78703 K-cycles.
77667 K-cycles.
78142 K-cycles.
86118 K-cycles.
77456 K-cycles.
77916 K-cycles.
77156 K-cycles.
78100 K-cycles.
81364 K-cycles.
77259 K-cycles.
86848 K-cycles.
79394 K-cycles.
81877 K-cycles.
80809 K-cycles.
84473 K-cycles.
81347 K-cycles.
Complete test (box pruning): found 11811 intersections in 77156 K-cycles.

Home PC:

Complete test (brute force): found 11811 intersections in 781900 K-cycles.
82811 K-cycles.
81905 K-cycles.
82147 K-cycles.
81923 K-cycles.
82156 K-cycles.
81878 K-cycles.
82289 K-cycles.
82125 K-cycles.
82474 K-cycles.
82050 K-cycles.
81945 K-cycles.
82257 K-cycles.
81902 K-cycles.
81834 K-cycles.
82587 K-cycles.
82360 K-cycles.
Complete test (box pruning): found 11811 intersections in 81834 K-cycles.

The gains are summarized here:

Home PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(101662)

Version2 - base

98822

0

0%

1.0

Version3

93138

~5600

~5%

~1.06

Version4

81834

~11000

~12%

~1.20

Office PC

Timings (K-Cycles)

Delta (K-Cycles)

Speedup

Overall X factor

(Version1)

(96203)

Version2 - base

92885

0

0%

1.0

Version3

88352

~4500

~5%

~1.05

Version4

77156

~11000

~12%

~1.20

Note that this optimization will only work if the input bounding boxes do not already contain our sentinel value - otherwise there would be no way to distinguish between the sentinel value and a regular value.

This is the first incidence of a rather common trend while optimizing code: the loss of genericity. Sometimes to make the code run faster, you need to accept some compromises or limitations compared to a fully “generic” implementation.

In this case, the limitation enforced by the optimization is that input bounding boxes cannot contain FLT_MAX. This is not a big problem in practice: usually only planes have infinite bounding boxes, and you can always make them use FLT_MAX/2 instead of FLT_MAX if needed. So this limitation is easy to accept. Sometimes they are a lot more debatable.

What we learnt:

Identify “useless” parts of the code that don’t contribute to the results, then eliminate them.

A less generic function can often run faster.

And that’s already it for this part, short and sweet.

GitHub code for part 4

 

shopfr.org cialis