SIMD box-triangle overlap function

A long time ago (around 2001), I helped Tomas Möller optimize a triangle-vs-box, SAT-based overlap test. At that time I was working on the Opcode library, and such a function was needed to achieve my goals.

When working on Opcode 2.0, I wrote a SIMD version of that same function. Since then I have found several other SIMD implementations of that code, and while benchmarking them I found that most of them were surprisingly slow - sometimes slower than the original code! I cannot really explain why, but it seems to indicate that writing the SIMD version is not as straightforward as one could have imagined. And thus, I am now releasing my implementation, in the hope somebody will find it useful.

There are two versions of the original code, labelled “old code” and “new, faster code” online. Despite the claims, on my Windows-based machine the “faster code” is significantly slower. It seems correct and expected that doing the “class III” tests last is the best, but I suppose this depends on both the architecture and the test scenario. In any case both versions (more or less verbatim, not quite) are included in the test benchmark.

The original code contains a lot of early exits, and thus the time it takes depends a lot on what codepath is taken. The best case is when you reach the first early exit. The worst case is when you reach the end of the function, i.e. the triangle and the box overlap. I believe one of the biggest mistakes made in the slow SIMD implementations was to get rid of branches and early exits at all costs (”Hey look! Zero branches! It must be super fast!”). Well of course that is a very naive view, things are way more subtle and complex than that. The fastest code is the one that is never executed, and thus, you should embrace early exits, not avoid them. In fact, my implementation contains one more early exit than the original code, for when the box fully contains the triangle (this was an important case for me in some application).

I have 4 different test scenarios:
- best case
- worst case
- no hit
- mixed

“Best case” is the aforementioned new early exit, for when the box fully contains the triangle. This is basically a point-in-AABB test, it is the first early exit in the code, and thus the “best case” is when we reach that codepath.

“Worst case” is when the box touches the triangle, but the triangle is not fully inside. In that case we reach the end of the overlap function without taking any early exits. This should be the slowest codepath.

“No hit” is when the triangle does not touch the box, and we reach one of the “regular” early exits (i.e. the ones that also existed in the non-SIMD code). Time depends on what codepath is reached.

“Mixed” is just a mix of the above cases.

Here are typical results on my test PC:

Version1 (Möller’s “old code”):

Time (default): 5072
Time (default): 3773
Time (default): 953
Time (default): 5088
Time (optimized): 246
Time (optimized): 1898
Time (optimized): 329
Time (optimized): 1544
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 20.617886
Worst case: Speedup: 1.987882
No hit:     Speedup: 2.896657
Mixed:      Speedup: 3.295337

Version2 (Möller’s “new, faster code”):

Time (default): 6209
Time (default): 4714
Time (default): 1304
Time (default): 5199
Time (optimized): 237
Time (optimized): 1822
Time (optimized): 316
Time (optimized): 1540
Hits0/Hits1: 16384/16384
Hits0/Hits1: 16384/16384
Hits0/Hits1: 0/0
Hits0/Hits1: 12420/12420
Best case:  Speedup: 26.198313
Worst case: Speedup: 2.587267
No hit:     Speedup: 4.126582
Mixed:      Speedup: 3.375974

So what we see here:

- the “new, faster code” is in fact slower. Thus, ignore “Version2″ from now on.

- the non-SIMD and SIMD version all find the same numbers of hits (as they should). I could have better tests for this but I’ve been using the SIMD version for a while now and it should be quite solid.

- the new, additional early exit gives me a 20X speedup. Not bad for a SIMD version whose theoretical maximum speedup is 4X. In the SIMD code, this is taken care of by this snippet:

if(1)
{
const unsigned int MaskI = 0×7fFFffFF;
__m128 cV = _mm_and_ps(v0V, _mm_load1_ps((const float*)&MaskI));
const unsigned int test = _mm_movemask_ps(_mm_sub_ps(cV, extentsV));
if((test&7)==7)
return 1;
}

That is well worth it if you ask me. Helped enormously in some past projects of mine.

- Speedups otherwise vary between 2X and 3X, which is what you usually expect from a good SIMD implementation.

And that’s about it. This is straightforward but apparently it is easy to write a bad SIMD version, that ends up slower than the original. I suppose some people just write the code, assume it’s good, and do not run benchmarks? I don’t know.

Bitcoin tip jar is here if you end up using that code. Project files are here.

6 Responses to “SIMD box-triangle overlap function”



  1. bob Says:

    unroll it. the latencies is what is costing you

  2. Bram Says:

    Yeah, it’s hard to beat a compiler.
    You make some valid points.

    Still, let me address two issues:

    1) Your SIMD version is still AoS (Array of Structures.)
    You use the 4-way SIMD to do x/y/z/w in parallel.
    This most likely will not get you maximum performance.
    What if you use 8-way (AVX) or 16-way (AVX512) SIMD?
    x,y,z,w and … ?

    Faster would be to use SoA (Structures of Arrays.)
    You store let’s say 1024 vertices as first 1024 floats of all x-coords, then 1024 y-coords, then 1024 z-coords. You can typically skip w.

    You can then compute 4 (or 8, or 16) triangles against a single box.
    Or alternatively, compute 1 triangle against 4/8/16 different boxes.

    Passing 4 triangles to a SoA functions would look like:

    simd_intersect_4_trias( __m128 v0x, __m128 v0y, __m128 v0z, __m128 v1x, __m128 v1y, __m128 v1z, __m128 v2x, __m128 v2y, __m128 v2z, )
    {
    }

    And yes, early out is then not possible, because four triangles are processed in one go.

    (2) Sometimes SIMD code gets a nasty stall when 128b/256b instruction sets are mixed.

    From Agner.org:

    “If a code sequence contains both 128-bit and 256-bit instructions then it is important to use the VEX-prefix version of the 128-bit instructions. Mixing 256 bit instructions with legacy 128-bit instructions without VEX prefix will cause the processor to switch the register file between different states which will cost many clock cycles on Intel processors. “

  3. admin Says:

    “unroll it”: I don’t understand what you mean here, since there are no loops to unroll (neither in the original code nor in the SIMD code). No idea.

  4. admin Says:

    Bram: I never claimed this version would give me “maximum performance”… The goal was simply to provide a drop-in replacement for the original (i.e. keeping the same interface), that would use SIMD and actually be faster than not using it. That’s all. Of course if you reorganize all the client code and its data, you have more options for extra perf.

    I’m afraid I don’t understand the point about 128b/256b instructions, since I am not using 256b instructions here. My home PC does not even support AVX. Maybe I am missing something.

  5. admin Says:

    Oh dear this blog sucks for comments. Screws up the formatting all the time.

  6. Dirk Says:

    I made the some observation for my SAT for ray-triangle. I made a branch free version and it was super slow compared to the usual clipping version. But when I added an early exit after testing the face directions I got a huge speed-up.

    One explanation I heard is that when there is a branch the CPU executes both paths and then can skip over the branch without penalty. If you make it branch free your code gets executed no matter what. Seems to be true for latest Intel CPU. On older XBox and PS3 hardware things can be totally different. Oh, well…

shopfr.org cialis