This writeup assumes some knowledge of SIMD programming, assembly and AABBs.

Introduction

From the multiple ways that an Axis Aligned Bounding Box can be represented in 2D, I think the most common is to store the minimum and maximum coordinates. With this approach, most basic operations (overlap tests, union/intersection, etc.) have simple implementations and offer decent performance.

Some time ago, I came up with a modification to this AABB representation that has the potential for better performance. I haven’t seen this technique mentioned or used anywhere, but if you have seen it somehow I would appreciate it if you could let me know.

So, let’s say we have a 2D AABB struct with two basic operations: TestOverlap (overlap check for two AABBs) and Union. These are the two basic operations that I wanted to optimize for a broad-phase collision detection algorithm. TestOverlap with a point and a corresponding Intersect function can be optimized in a similar way.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct AABB {
  float minx, miny;
  float maxx, maxy;
};

bool TestOverlap(AABB a, AABB b) {
  return a.minx <= b.maxx
      && a.miny <= b.maxy
      && a.maxx >= b.minx
      && a.maxx >= b.minx;
}

AABB Union(AABB a, AABB b) {
  return {
    min(a.minx, b.minx),
    min(a.miny, b.miny),
    max(a.maxx, b.maxx),
    max(a.maxy, b.maxy)
  };
}

AABB Union(AABB a, float x, float y) { ... }
AABB Intersect(AABB a, AABB b) { ... }

Basic idea

Since the optimizations are SIMD-oriented, a first observation is that the four float coordinates fit nicely in a 128-bit SIMD vector (or similarly a 256-bit vector if doubles are used). While the data fits nicely, the code has an awkward asymmetry: TestOverlap has a mixture of >= and <= and Union has min and max. Otherwise, it would be straightforward to use a packed vector min/max or a vector comparison.

So following this observation, my idea was to store the max coordinate of the AABB negated instead. Negating only one of the two coordinates makes the implementations symmetric and SIMD-friendly. If we substitute maxx/maxy with -maxx/-maxy in the above operations we get:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
bool TestOverlap(AABB a, AABB b) {
  return a.minx <= -b.maxx
      && a.miny <= -b.maxy
      && a.maxx <= -b.minx  // -a.maxx >= b.minx
      && a.maxx <= -b.minx; // -a.maxy >= b.miny
}

AABB Union(AABB a, AABB b) {
  return {
    min(a.minx, b.minx),
    min(a.miny, b.miny),
    min(a.maxx, b.maxx), // -max(-a.maxx, -b.maxx)
    min(a.maxy, b.maxy)  // -max(-a.maxy, -b.maxy)
  };
}

This looks better! Union could now be a single packed minimum operation and while TestOverlap needs multiple instructions it has more optimization potential than before. It’s also great that this implementation detail can be abstracted away if we disallow direct member access:

1
2
3
4
5
6
struct AABB {
  AABB(float x1, float y1, float x2, float y2) : minx(x1), miny(y1), maxx(-x2), maxy(-y2) {}
  float getMinX() { return minx; }
  float getMaxX() { return -maxx; }
  ...
};

Compilers and analysis of the generated code

Since we haven’t used any SIMD intrinsics, we’re basically leaving the rest up to the compiler…​ which may or may not be a great idea. We need to go further and check what assembly is produced by the latest (and greatest) compilers, at the time of writing GCC 14 and clang 18. The source code and generated assembly can be seen below or in this godbolt link.

Full source code
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <cmath>

struct AABB {
  float minx, miny;
  float maxx, maxy;
};

// Normal implementation

bool TestOverlap_1(AABB a, AABB b) {
  return a.minx <= b.maxx
      && a.miny <= b.maxy
      && a.maxx >= b.minx
      && a.maxx >= b.minx;
}

AABB Union_1(AABB a, AABB b) {
  return {
    std::min(a.minx, b.minx),
    std::min(a.miny, b.miny),
    std::max(a.maxx, b.maxx),
    std::max(a.maxy, b.maxy)
  };
}

// Optimized representation implementation.

bool TestOverlap_2(AABB a, AABB b) {
  return a.minx <= -b.maxx
      && a.miny <= -b.maxy
      && a.maxx <= -b.minx  // -a.maxx >= b.minx
      && a.maxy <= -b.miny; // -a.maxy >= b.miny
}

bool TestOverlap_2_2(AABB a, AABB b) {
  return a.minx <= -b.maxx
       & a.miny <= -b.maxy
       & a.maxx <= -b.minx  // -a.maxx >= b.minx
       & a.maxy <= -b.miny; // -a.maxy >= b.miny
}

AABB Union_2(AABB a, AABB b) {
  return {
    std::min(a.minx, b.minx),
    std::min(a.miny, b.miny),
    std::min(a.maxx, b.maxx), // -max(-a.maxx, -b.maxx)
    std::min(a.maxy, b.maxy)  // -max(-a.maxy, -b.maxy)
  };
}
Compiled with gcc 14 at -O3
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
TestOverlap_1(AABB, AABB):
        movq    %xmm1, %rax
        movq    %xmm2, %rdx
        movq    %xmm0, %rsi
        movq    %rax, %rdi
        movq    %xmm3, %rax
        movq    %rsi, %rcx
        xchgq   %rdx, %rax
        movd    %edx, %xmm1
        movq    %rax, %rsi
        movq    %rdx, %rax
        comiss  %xmm0, %xmm1
        jb      .L10
        shrq    $32, %rcx
        shrq    $32, %rax
        movd    %eax, %xmm0
        movd    %ecx, %xmm4
        comiss  %xmm4, %xmm0
        jb      .L10
        movd    %esi, %xmm0
        movd    %edi, %xmm1
        comiss  %xmm0, %xmm1
        setnb   %al
        ret
.L10:
        xorl    %eax, %eax
        ret
Union_1(AABB, AABB):
        movq    %xmm1, %rax
        movdqa  %xmm0, %xmm8
        movdqa  %xmm2, %xmm5
        movq    %rax, %rsi
        movq    %xmm2, %rax
        minss   %xmm8, %xmm5
        movq    %xmm0, %rdi
        movd    %esi, %xmm7
        movq    %rax, %rdx
        shrq    $32, %rsi
        movq    %xmm3, %r9
        movq    %rdi, %rcx
        shrq    $32, %rdx
        movdqa  %xmm3, %xmm6
        movd    %edx, %xmm1
        shrq    $32, %rcx
        movd    %esi, %xmm3
        maxss   %xmm7, %xmm6
        movq    %r9, %xmm4
        movd    %ecx, %xmm2
        shufps  $85, %xmm4, %xmm4
        movdqa  %xmm4, %xmm0
        movaps  %xmm5, %xmm4
        movaps  %xmm1, %xmm5
        minss   %xmm2, %xmm5
        maxss   %xmm3, %xmm0
        movaps  %xmm6, %xmm1
        unpcklps        %xmm5, %xmm4
        unpcklps        %xmm0, %xmm1
        movaps  %xmm4, %xmm3
        movlhps %xmm1, %xmm3
        movaps  %xmm3, -24(%rsp)
        movq    -16(%rsp), %xmm1
        movq    -24(%rsp), %xmm0
        ret
TestOverlap_2(AABB, AABB):
        movq    %xmm1, %rax
        movq    %xmm2, %rdx
        movq    %xmm0, %rsi
        movq    %rax, %rdi
        movq    %xmm3, %rax
        movdqa  %xmm0, %xmm2
        movq    %rsi, %rcx
        xchgq   %rdx, %rax
        movss   .LC0(%rip), %xmm1
        movd    %edx, %xmm0
        movq    %rax, %rsi
        movq    %rdx, %rax
        xorps   %xmm1, %xmm0
        comiss  %xmm2, %xmm0
        jb      .L35
        shrq    $32, %rax
        shrq    $32, %rcx
        movd    %eax, %xmm0
        movd    %ecx, %xmm4
        xorps   %xmm1, %xmm0
        comiss  %xmm4, %xmm0
        jb      .L35
        movd    %esi, %xmm2
        movd    %edi, %xmm0
        xorps   %xmm1, %xmm2
        comiss  %xmm0, %xmm2
        jb      .L35
        shrq    $32, %rsi
        shrq    $32, %rdi
        movd    %esi, %xmm5
        movd    %edi, %xmm6
        xorps   %xmm1, %xmm5
        comiss  %xmm6, %xmm5
        setnb   %al
        ret
.L35:
        xorl    %eax, %eax
        ret
TestOverlap_2_2(AABB, AABB):
        movq    %xmm1, %rax
        movq    %xmm0, %rsi
        movq    %xmm3, %r9
        movss   .LC0(%rip), %xmm1
        movq    %rsi, %rdi
        movq    %rax, %rsi
        movq    %r9, %rcx
        movq    %xmm2, %rax
        movdqa  %xmm0, %xmm2
        movdqa  %xmm3, %xmm0
        xorps   %xmm1, %xmm0
        movq    %rax, %rdx
        comiss  %xmm2, %xmm0
        movd    %esi, %xmm2
        setnb   %al
        shrq    $32, %rcx
        shrq    $32, %rdi
        movd    %ecx, %xmm0
        xorps   %xmm1, %xmm0
        movd    %edi, %xmm5
        comiss  %xmm5, %xmm0
        movd    %edx, %xmm0
        xorps   %xmm1, %xmm0
        setnb   %cl
        andl    %ecx, %eax
        comiss  %xmm2, %xmm0
        setnb   %cl
        shrq    $32, %rdx
        shrq    $32, %rsi
        movd    %edx, %xmm4
        andl    %ecx, %eax
        xorps   %xmm1, %xmm4
        movd    %esi, %xmm6
        comiss  %xmm6, %xmm4
        setnb   %dl
        andl    %edx, %eax
        ret
Union_2(AABB, AABB):
        punpcklqdq      %xmm3, %xmm2
        punpcklqdq      %xmm1, %xmm0
        movaps  %xmm2, %xmm4
        minps   %xmm0, %xmm4
        movaps  %xmm4, -24(%rsp)
        movq    -16(%rsp), %xmm1
        movq    -24(%rsp), %xmm0
        ret
.LC0:
        .long   -2147483648
        .long   0
        .long   0
        .long   0
Compiled with clang 18 at -O3
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
TestOverlap_1(AABB, AABB):
        ucomiss xmm3, xmm0
        setae   cl
        cmpleps xmm0, xmm3
        pextrw  edx, xmm0, 2
        ucomiss xmm1, xmm2
        setae   al
        and     al, cl
        and     al, dl
        ret
Union_1(AABB, AABB):
        movaps  xmm4, xmm2
        movlhps xmm4, xmm1
        movaps  xmm5, xmm0
        movlhps xmm5, xmm3
        cmpltps xmm4, xmm5
        movlhps xmm2, xmm3
        movlhps xmm0, xmm1
        andps   xmm2, xmm4
        andnps  xmm4, xmm0
        orps    xmm4, xmm2
        pshufd  xmm1, xmm4, 238
        movaps  xmm0, xmm4
        ret
.LCPI2_0:
        .long   0x80000000
        .long   0x80000000
        .long   0x80000000
        .long   0x80000000
TestOverlap_2(AABB, AABB):
        movaps  xmm4, xmmword ptr [rip + .LCPI2_0]
        xorps   xmm4, xmm3
        ucomiss xmm4, xmm0
        jae     .LBB2_2
        xor     eax, eax
        ret
.LBB2_2:
        shufps  xmm0, xmm0, 85
        shufps  xmm3, xmm3, 85
        movaps  xmm4, xmmword ptr [rip + .LCPI2_0]
        xorps   xmm3, xmm4
        xor     eax, eax
        ucomiss xmm3, xmm0
        jb      .LBB2_5
        xorps   xmm4, xmm2
        ucomiss xmm4, xmm1
        jb      .LBB2_5
        shufps  xmm2, xmm2, 85
        xorps   xmm2, xmmword ptr [rip + .LCPI2_0]
        shufps  xmm1, xmm1, 85
        ucomiss xmm2, xmm1
        setae   al
.LBB2_5:
        ret
.LCPI3_0:
        .long   0x80000000
        .long   0x80000000
        .long   0x80000000
        .long   0x80000000
TestOverlap_2_2(AABB, AABB):
        shufps  xmm3, xmm2, 65
        xorps   xmm3, xmmword ptr [rip + .LCPI3_0]
        shufps  xmm0, xmm1, 65
        cmpnleps        xmm0, xmm3
        movmskps        eax, xmm0
        test    eax, eax
        sete    al
        ret
Union_2(AABB, AABB):
        movlhps xmm2, xmm3
        movlhps xmm0, xmm1
        minps   xmm2, xmm0
        movaps  xmm1, xmm2
        unpckhpd        xmm1, xmm2
        movaps  xmm0, xmm2
        ret

I won’t analyze this very deeply, but there are some interesting things going on:

  • GCC produces almost twice the code and emits more jumps (as a GCC developer myself, sigh…​)

  • Clang produces branchless and very concise (and quite beautiful) code for TestOverlap_1. One nice trick is that it uses a packed comparison for half the comparisons to reduce the number of operations needed.

  • Note that some instructions in these functions only exist due to the ABI and the compiler should produce better code once the functions are inlined. For example, all instructions except for minps in Union_2 are unnecessary.

  • There is no vector negation instruction, so -x is implemented with two instructions that also involve a memory read (movaps xmm4, xmmword ptr [rip + .LCPI2_0] + xorps xmm4, xmm3). This is unfortunate; one alternative is to use 0.0f - x to get xorps xmm5, xmm5 and subss xmm5, xmm3 but this shouldn’t be very important.

  • TestOverlap_2 does very badly for both compilers. I have included TestOverlap_2_2 which is the same as TestOverlap_2 but uses bitwise & instead of logical && and for clang this does a much better job: Branchless, short and we can spot the expected cmpnleps. I find it funny and sad that such a difference can be observed due to using & instead of && in code without side effects…​ but anyway.

My takeaway is: clang tries and does decently but code generation is still sensitive to small things, while GCC is quite far from optimal (Note to self: open a GCC ticket with these testcases). In any case, since we’re dealing with potentially very hot functions and the compilers aren’t consistent enough, the real answer is that we have to use SIMD intrinsics (or an equivalent SIMD library of your choice).

Explicit SIMD with intrinsics

My current SIMD implementations for these, at the time of writing, are:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
bool TestOverlap(__m128 a, __m128 b) {
  // Negate B
  b = -b; // or 0.0f - b
  // Shuffle AABB to align a.max with -b.min and a.min with -b.max
  a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2));
  // Overlap iff a <= b
  __m128 cmp = (a <= b);
  return _mm_test_all_ones(_mm_castps_si128(cmp));
}

__m128 Union(__m128 a, __m128 b) {
  return _mm_min_ps(a, b);
}

The code generation comparison of these SIMD functions with the rest can be seen in this godbolt link. The union version is shorter only because of the ABI, so there’s no difference really, and the handwritten TestOverlap_3 is only one instruction less than the previously best function TestOverlap_2_2. But more importantly, we have won consistent and predictable code generation. GCC’s TestOverlap_3 is unfortunately still a bit worse, even with the intrinsics, and results in 2 more instructions vs clang’s.

Performance?

At this point, a good question would be "Are the SIMD versions any faster?". I don’t want to create microbenchmarks for these and I believe they would be misleading for their real-world performance implications. In this case I will just comment on the generated code: - Union_2/Union_3 consist of measurably fewer instructions so they could be faster in principle. - TestOverlap_3 is marginally better (-2 instructions) than the best TestOverlap_1. Although this could be important depending on the use case, it’s not as good as I’d like. - The intrinsics versions are much better than the worse options, so depending on your compiler and flags, they could be a measurable improvement.

A final optimization/observation

As I mentioned, these implementations were part of an effort to optimize a broad-phase algorithm. The hottest part of the algorithm looked like this:

1
2
3
4
5
6
7
8
9
void CollisionDetection(AABB* aabbs, AABB ref) {
  ...
  for (int i = start; i < end; i++) {
    if (TestOverlap(aabbs[i], ref)) {
      // do something
    }
  }
  ...
}

where the 'do something' part is just a few instructions. This pattern is common in many broad-phase algorithms and that’s why I’ll explain a last but important optimization. If you’ll excuse some more assembly, consider a function like:

1
2
3
4
5
void CollisionDetection(__m128 a, __m128 b) {
  if (TestOverlap_3(a, b)) {
    do_something();
  }
}

The generated code is

1
2
3
4
5
6
7
8
CollisionDetection(float __vector(4), float __vector(4)):
  xorps   xmm1, xmmword ptr [rip + .LCPI7_0]
  shufps  xmm0, xmm0, 78
  cmpleps xmm0, xmm1
  pcmpeqd xmm1, xmm1
  ptest   xmm0, xmm1
  jb      do_something()
  ret

If we discount for jb and ret, the actual overlap test is 5 instructions which is quite good. If we were to use TestOverlap_1 instead things would get worse. But if you recall the actual loop that we want to optimize, an important insight is that one of the two AABBs is loop-invariant; we’re testing overlap between a specific AABB and a collection of AABBs. Of the 5 instructions in the hot loop, one is the negation and the other one is the shuffle. Because we can shuffle any of the two vectors and get the same result, this means we can 'preprocess' the ref AABB outside of the loop and then the overlap test will consist of just three instructions! (cmpleps + pcmpeqd + ptest). If we change a with b in the _mm_shuffle_ps in TestOverlap_3 and use it within a loop with one of the two AABBs being loop-invariant, then the compiler will (probably) move the negation and shuffle outside of the loop.

We have a single-instruction union and an overlap test in three instructions (for a particular type of loop). I doubt we can do better for these, so let me say: mission complete!

Closing

Although I believe I provided enough details, you should take care if you implement this in your code. Remember to check the generated assembly because things can go wrong for several reasons and compilers can be fragile. Also while the same ideas could be used for 3D AABBs, they don’t apply as cleanly (6 coordinates) and I haven’t thought much about that case.

Feel free to use these ideas and code; I would appreciate mentioning my work if it helped you. I also would like to know if anyone found the techniques to be useful in practice.

I should note that I’ve been meaning to write this up for a long time and didn’t. I only now found a good excuse, which is Erin Catto’s amazing Box2D v3 (i.e. https://github.com/erincatto/box2c) and all the amazing optimizations he is doing. I wish that these optimizations find their way in Box2D (and other physics engines).