This writeup assumes some knowledge of SIMD programming, assembly and AABBs. |
Introduction
From the multiple ways that an Axis Aligned Bounding Box can be represented in 2D, I think the most common is to store the minimum and maximum coordinates. With this approach, most basic operations (overlap tests, union/intersection, etc.) have simple implementations and offer decent performance.
Some time ago, I came up with a modification to this AABB representation that has the potential for better performance. I haven’t seen this technique mentioned or used anywhere, but if you have seen it somehow I would appreciate it if you could let me know.
So, let’s say we have a 2D AABB struct with two basic operations: TestOverlap
(overlap check for two AABBs) and Union
. These are the two basic operations that I wanted to optimize for a broad-phase collision detection algorithm. TestOverlap
with a point and a corresponding Intersect
function can be optimized in a similar way.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
struct AABB {
float minx, miny;
float maxx, maxy;
};
bool TestOverlap(AABB a, AABB b) {
return a.minx <= b.maxx
&& a.miny <= b.maxy
&& a.maxx >= b.minx
&& a.maxx >= b.minx;
}
AABB Union(AABB a, AABB b) {
return {
min(a.minx, b.minx),
min(a.miny, b.miny),
max(a.maxx, b.maxx),
max(a.maxy, b.maxy)
};
}
AABB Union(AABB a, float x, float y) { ... }
AABB Intersect(AABB a, AABB b) { ... }
Basic idea
Since the optimizations are SIMD-oriented, a first observation is that the four float coordinates fit nicely in a 128-bit SIMD vector (or similarly a 256-bit vector if doubles are used). While the data fits nicely, the code has an awkward asymmetry: TestOverlap
has a mixture of >=
and <=
and Union
has min
and max
. Otherwise, it would be straightforward to use a packed vector min/max or a vector comparison.
So following this observation, my idea was to store the max coordinate of the AABB negated instead. Negating only one of the two coordinates makes the implementations symmetric and SIMD-friendly. If we substitute maxx
/maxy
with -maxx
/-maxy
in the above operations we get:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
bool TestOverlap(AABB a, AABB b) {
return a.minx <= -b.maxx
&& a.miny <= -b.maxy
&& a.maxx <= -b.minx // -a.maxx >= b.minx
&& a.maxx <= -b.minx; // -a.maxy >= b.miny
}
AABB Union(AABB a, AABB b) {
return {
min(a.minx, b.minx),
min(a.miny, b.miny),
min(a.maxx, b.maxx), // -max(-a.maxx, -b.maxx)
min(a.maxy, b.maxy) // -max(-a.maxy, -b.maxy)
};
}
This looks better! Union
could now be a single packed minimum operation and while TestOverlap
needs multiple instructions it has more optimization potential than before. It’s also great that this implementation detail can be abstracted away if we disallow direct member access:
1
2
3
4
5
6
struct AABB {
AABB(float x1, float y1, float x2, float y2) : minx(x1), miny(y1), maxx(-x2), maxy(-y2) {}
float getMinX() { return minx; }
float getMaxX() { return -maxx; }
...
};
Compilers and analysis of the generated code
Since we haven’t used any SIMD intrinsics, we’re basically leaving the rest up to the compiler… which may or may not be a great idea. We need to go further and check what assembly is produced by the latest (and greatest) compilers, at the time of writing GCC 14 and clang 18. The source code and generated assembly can be seen below or in this godbolt link.
Full source code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#include <cmath>
struct AABB {
float minx, miny;
float maxx, maxy;
};
// Normal implementation
bool TestOverlap_1(AABB a, AABB b) {
return a.minx <= b.maxx
&& a.miny <= b.maxy
&& a.maxx >= b.minx
&& a.maxx >= b.minx;
}
AABB Union_1(AABB a, AABB b) {
return {
std::min(a.minx, b.minx),
std::min(a.miny, b.miny),
std::max(a.maxx, b.maxx),
std::max(a.maxy, b.maxy)
};
}
// Optimized representation implementation.
bool TestOverlap_2(AABB a, AABB b) {
return a.minx <= -b.maxx
&& a.miny <= -b.maxy
&& a.maxx <= -b.minx // -a.maxx >= b.minx
&& a.maxy <= -b.miny; // -a.maxy >= b.miny
}
bool TestOverlap_2_2(AABB a, AABB b) {
return a.minx <= -b.maxx
& a.miny <= -b.maxy
& a.maxx <= -b.minx // -a.maxx >= b.minx
& a.maxy <= -b.miny; // -a.maxy >= b.miny
}
AABB Union_2(AABB a, AABB b) {
return {
std::min(a.minx, b.minx),
std::min(a.miny, b.miny),
std::min(a.maxx, b.maxx), // -max(-a.maxx, -b.maxx)
std::min(a.maxy, b.maxy) // -max(-a.maxy, -b.maxy)
};
}
Compiled with gcc 14 at -O3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
TestOverlap_1(AABB, AABB):
movq %xmm1, %rax
movq %xmm2, %rdx
movq %xmm0, %rsi
movq %rax, %rdi
movq %xmm3, %rax
movq %rsi, %rcx
xchgq %rdx, %rax
movd %edx, %xmm1
movq %rax, %rsi
movq %rdx, %rax
comiss %xmm0, %xmm1
jb .L10
shrq $32, %rcx
shrq $32, %rax
movd %eax, %xmm0
movd %ecx, %xmm4
comiss %xmm4, %xmm0
jb .L10
movd %esi, %xmm0
movd %edi, %xmm1
comiss %xmm0, %xmm1
setnb %al
ret
.L10:
xorl %eax, %eax
ret
Union_1(AABB, AABB):
movq %xmm1, %rax
movdqa %xmm0, %xmm8
movdqa %xmm2, %xmm5
movq %rax, %rsi
movq %xmm2, %rax
minss %xmm8, %xmm5
movq %xmm0, %rdi
movd %esi, %xmm7
movq %rax, %rdx
shrq $32, %rsi
movq %xmm3, %r9
movq %rdi, %rcx
shrq $32, %rdx
movdqa %xmm3, %xmm6
movd %edx, %xmm1
shrq $32, %rcx
movd %esi, %xmm3
maxss %xmm7, %xmm6
movq %r9, %xmm4
movd %ecx, %xmm2
shufps $85, %xmm4, %xmm4
movdqa %xmm4, %xmm0
movaps %xmm5, %xmm4
movaps %xmm1, %xmm5
minss %xmm2, %xmm5
maxss %xmm3, %xmm0
movaps %xmm6, %xmm1
unpcklps %xmm5, %xmm4
unpcklps %xmm0, %xmm1
movaps %xmm4, %xmm3
movlhps %xmm1, %xmm3
movaps %xmm3, -24(%rsp)
movq -16(%rsp), %xmm1
movq -24(%rsp), %xmm0
ret
TestOverlap_2(AABB, AABB):
movq %xmm1, %rax
movq %xmm2, %rdx
movq %xmm0, %rsi
movq %rax, %rdi
movq %xmm3, %rax
movdqa %xmm0, %xmm2
movq %rsi, %rcx
xchgq %rdx, %rax
movss .LC0(%rip), %xmm1
movd %edx, %xmm0
movq %rax, %rsi
movq %rdx, %rax
xorps %xmm1, %xmm0
comiss %xmm2, %xmm0
jb .L35
shrq $32, %rax
shrq $32, %rcx
movd %eax, %xmm0
movd %ecx, %xmm4
xorps %xmm1, %xmm0
comiss %xmm4, %xmm0
jb .L35
movd %esi, %xmm2
movd %edi, %xmm0
xorps %xmm1, %xmm2
comiss %xmm0, %xmm2
jb .L35
shrq $32, %rsi
shrq $32, %rdi
movd %esi, %xmm5
movd %edi, %xmm6
xorps %xmm1, %xmm5
comiss %xmm6, %xmm5
setnb %al
ret
.L35:
xorl %eax, %eax
ret
TestOverlap_2_2(AABB, AABB):
movq %xmm1, %rax
movq %xmm0, %rsi
movq %xmm3, %r9
movss .LC0(%rip), %xmm1
movq %rsi, %rdi
movq %rax, %rsi
movq %r9, %rcx
movq %xmm2, %rax
movdqa %xmm0, %xmm2
movdqa %xmm3, %xmm0
xorps %xmm1, %xmm0
movq %rax, %rdx
comiss %xmm2, %xmm0
movd %esi, %xmm2
setnb %al
shrq $32, %rcx
shrq $32, %rdi
movd %ecx, %xmm0
xorps %xmm1, %xmm0
movd %edi, %xmm5
comiss %xmm5, %xmm0
movd %edx, %xmm0
xorps %xmm1, %xmm0
setnb %cl
andl %ecx, %eax
comiss %xmm2, %xmm0
setnb %cl
shrq $32, %rdx
shrq $32, %rsi
movd %edx, %xmm4
andl %ecx, %eax
xorps %xmm1, %xmm4
movd %esi, %xmm6
comiss %xmm6, %xmm4
setnb %dl
andl %edx, %eax
ret
Union_2(AABB, AABB):
punpcklqdq %xmm3, %xmm2
punpcklqdq %xmm1, %xmm0
movaps %xmm2, %xmm4
minps %xmm0, %xmm4
movaps %xmm4, -24(%rsp)
movq -16(%rsp), %xmm1
movq -24(%rsp), %xmm0
ret
.LC0:
.long -2147483648
.long 0
.long 0
.long 0
Compiled with clang 18 at -O3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
TestOverlap_1(AABB, AABB):
ucomiss xmm3, xmm0
setae cl
cmpleps xmm0, xmm3
pextrw edx, xmm0, 2
ucomiss xmm1, xmm2
setae al
and al, cl
and al, dl
ret
Union_1(AABB, AABB):
movaps xmm4, xmm2
movlhps xmm4, xmm1
movaps xmm5, xmm0
movlhps xmm5, xmm3
cmpltps xmm4, xmm5
movlhps xmm2, xmm3
movlhps xmm0, xmm1
andps xmm2, xmm4
andnps xmm4, xmm0
orps xmm4, xmm2
pshufd xmm1, xmm4, 238
movaps xmm0, xmm4
ret
.LCPI2_0:
.long 0x80000000
.long 0x80000000
.long 0x80000000
.long 0x80000000
TestOverlap_2(AABB, AABB):
movaps xmm4, xmmword ptr [rip + .LCPI2_0]
xorps xmm4, xmm3
ucomiss xmm4, xmm0
jae .LBB2_2
xor eax, eax
ret
.LBB2_2:
shufps xmm0, xmm0, 85
shufps xmm3, xmm3, 85
movaps xmm4, xmmword ptr [rip + .LCPI2_0]
xorps xmm3, xmm4
xor eax, eax
ucomiss xmm3, xmm0
jb .LBB2_5
xorps xmm4, xmm2
ucomiss xmm4, xmm1
jb .LBB2_5
shufps xmm2, xmm2, 85
xorps xmm2, xmmword ptr [rip + .LCPI2_0]
shufps xmm1, xmm1, 85
ucomiss xmm2, xmm1
setae al
.LBB2_5:
ret
.LCPI3_0:
.long 0x80000000
.long 0x80000000
.long 0x80000000
.long 0x80000000
TestOverlap_2_2(AABB, AABB):
shufps xmm3, xmm2, 65
xorps xmm3, xmmword ptr [rip + .LCPI3_0]
shufps xmm0, xmm1, 65
cmpnleps xmm0, xmm3
movmskps eax, xmm0
test eax, eax
sete al
ret
Union_2(AABB, AABB):
movlhps xmm2, xmm3
movlhps xmm0, xmm1
minps xmm2, xmm0
movaps xmm1, xmm2
unpckhpd xmm1, xmm2
movaps xmm0, xmm2
ret
I won’t analyze this very deeply, but there are some interesting things going on:
-
GCC produces almost twice the code and emits more jumps (as a GCC developer myself, sigh…)
-
Clang produces branchless and very concise (and quite beautiful) code for
TestOverlap_1
. One nice trick is that it uses a packed comparison for half the comparisons to reduce the number of operations needed. -
Note that some instructions in these functions only exist due to the ABI and the compiler should produce better code once the functions are inlined. For example, all instructions except for
minps
inUnion_2
are unnecessary. -
There is no vector negation instruction, so
-x
is implemented with two instructions that also involve a memory read (movaps xmm4, xmmword ptr [rip + .LCPI2_0]
+xorps xmm4, xmm3
). This is unfortunate; one alternative is to use0.0f - x
to getxorps xmm5, xmm5
andsubss xmm5, xmm3
but this shouldn’t be very important. -
TestOverlap_2
does very badly for both compilers. I have includedTestOverlap_2_2
which is the same asTestOverlap_2
but uses bitwise&
instead of logical&&
and for clang this does a much better job: Branchless, short and we can spot the expectedcmpnleps
. I find it funny and sad that such a difference can be observed due to using&
instead of&&
in code without side effects… but anyway.
My takeaway is: clang tries and does decently but code generation is still sensitive to small things, while GCC is quite far from optimal (Note to self: open a GCC ticket with these testcases). In any case, since we’re dealing with potentially very hot functions and the compilers aren’t consistent enough, the real answer is that we have to use SIMD intrinsics (or an equivalent SIMD library of your choice).
Explicit SIMD with intrinsics
My current SIMD implementations for these, at the time of writing, are:
1
2
3
4
5
6
7
8
9
10
11
12
13
bool TestOverlap(__m128 a, __m128 b) {
// Negate B
b = -b; // or 0.0f - b
// Shuffle AABB to align a.max with -b.min and a.min with -b.max
a = _mm_shuffle_ps(a, a, _MM_SHUFFLE(1, 0, 3, 2));
// Overlap iff a <= b
__m128 cmp = (a <= b);
return _mm_test_all_ones(_mm_castps_si128(cmp));
}
__m128 Union(__m128 a, __m128 b) {
return _mm_min_ps(a, b);
}
The code generation comparison of these SIMD functions with the rest can be seen in this godbolt link. The union version is shorter only because of the ABI, so there’s no difference really, and the handwritten TestOverlap_3
is only one instruction less than the previously best function TestOverlap_2_2
. But more importantly, we have won consistent and predictable code generation. GCC’s TestOverlap_3
is unfortunately still a bit worse, even with the intrinsics, and results in 2 more instructions vs clang’s.
Performance?
At this point, a good question would be "Are the SIMD versions any faster?". I don’t want to create microbenchmarks for these and I believe they would be misleading for their real-world performance implications. In this case I will just comment on the generated code:
- Union_2
/Union_3
consist of measurably fewer instructions so they could be faster in principle.
- TestOverlap_3
is marginally better (-2 instructions) than the best TestOverlap_1
. Although this could be important depending on the use case, it’s not as good as I’d like.
- The intrinsics versions are much better than the worse options, so depending on your compiler and flags, they could be a measurable improvement.
A final optimization/observation
As I mentioned, these implementations were part of an effort to optimize a broad-phase algorithm. The hottest part of the algorithm looked like this:
1
2
3
4
5
6
7
8
9
void CollisionDetection(AABB* aabbs, AABB ref) {
...
for (int i = start; i < end; i++) {
if (TestOverlap(aabbs[i], ref)) {
// do something
}
}
...
}
where the 'do something' part is just a few instructions. This pattern is common in many broad-phase algorithms and that’s why I’ll explain a last but important optimization. If you’ll excuse some more assembly, consider a function like:
1
2
3
4
5
void CollisionDetection(__m128 a, __m128 b) {
if (TestOverlap_3(a, b)) {
do_something();
}
}
The generated code is
1
2
3
4
5
6
7
8
CollisionDetection(float __vector(4), float __vector(4)):
xorps xmm1, xmmword ptr [rip + .LCPI7_0]
shufps xmm0, xmm0, 78
cmpleps xmm0, xmm1
pcmpeqd xmm1, xmm1
ptest xmm0, xmm1
jb do_something()
ret
If we discount for jb
and ret
, the actual overlap test is 5 instructions which is quite good. If we were to use TestOverlap_1
instead things would get worse. But if you recall the actual loop that we want to optimize, an important insight is that one of the two AABBs is loop-invariant; we’re testing overlap between a specific AABB and a collection of AABBs. Of the 5 instructions in the hot loop, one is the negation and the other one is the shuffle. Because we can shuffle any of the two vectors and get the same result, this means we can 'preprocess' the ref
AABB outside of the loop and then the overlap test will consist of just three instructions! (cmpleps
+ pcmpeqd
+ ptest
). If we change a
with b
in the _mm_shuffle_ps
in TestOverlap_3
and use it within a loop with one of the two AABBs being loop-invariant, then the compiler will (probably) move the negation and shuffle outside of the loop.
We have a single-instruction union and an overlap test in three instructions (for a particular type of loop). I doubt we can do better for these, so let me say: mission complete!
Closing
Although I believe I provided enough details, you should take care if you implement this in your code. Remember to check the generated assembly because things can go wrong for several reasons and compilers can be fragile. Also while the same ideas could be used for 3D AABBs, they don’t apply as cleanly (6 coordinates) and I haven’t thought much about that case.
Feel free to use these ideas and code; I would appreciate mentioning my work if it helped you. I also would like to know if anyone found the techniques to be useful in practice.
I should note that I’ve been meaning to write this up for a long time and didn’t. I only now found a good excuse, which is Erin Catto’s amazing Box2D v3 (i.e. https://github.com/erincatto/box2c) and all the amazing optimizations he is doing. I wish that these optimizations find their way in Box2D (and other physics engines).