Here I look at the ability of gcc to vectorize through LTO. Vectorization often occurs after inlining certain functions, for example matrix index methods. But if these methods happen to be in a separate translation unit can we still achieve vectorization after enabling LTO? The “first-principles” answer is “obviously yes it should” but one should never just take this for granted. I test it here.
To test this I build two identical matrix classes: Matrix<T>
and LtoMatrix<T>
. The difference between them is that Matrix<T>
is a standard inlined template class, but LtoMatrix<T>
manually instantiates LtoMatrix<float>
and provides a separate ltomatrix.cpp
file for separate compilation. Both matrix classes implement operator()(size_t,size_t)
and this entrypoint is usually where vectorization is inferred by compilers, for example if the inner-loop of matrix-matrix multiplication occurs on the fastest dimension of the matrix and it inlined operator()(size_t,size_t)
then it will correctly deduce that it can issue packed vector instructions. but in the case of LtoMatrix
this method is in a separate translation unit and can’t be directly inlined. With LTO enabled however it can. but does that mean it will?
I wrote a simple experiment here: https://github.com/ReidAtcheson/lto_vectorization/tree/main. It includes a spack.yaml in case you wanted to use the same compiler versions I tested this with.
First I show just raw timings using google benchmark:
reidatcheson@pop-os:~/clones/lto_vectorization$ ./run.sh
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/reidatcheson/clones/lto_vectorization/build
[ 25%] Building CXX object CMakeFiles/ltomatrix.dir/ltomatrix.cpp.o
[ 50%] Linking CXX static library libltomatrix.a
[ 50%] Built target ltomatrix
[ 75%] Building CXX object CMakeFiles/main.dir/main.cpp.o
[100%] Linking CXX executable main
[100%] Built target main
2025-06-01T14:14:28-04:00
Running ./main
Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1280 KiB (x8)
L3 Unified 18432 KiB (x1)
Load Average: 0.94, 0.71, 0.66
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------
BM_MatMul<Matrix<float>, 32> 964 ns 963 ns 718122 FLOPS=68.0343G/s INSTRUCTIONS=9.96k
BM_MatMul<Matrix<float>, 64> 7920 ns 7919 ns 86617 FLOPS=66.2042G/s INSTRUCTIONS=64.887k
BM_MatMul<Matrix<float>, 128> 97584 ns 97561 ns 7354 FLOPS=42.9916G/s INSTRUCTIONS=495.382k
BM_MatMul<Matrix<float>, 256> 908046 ns 907628 ns 797 FLOPS=36.9694G/s INSTRUCTIONS=7.73568M
BM_MatMul<LtoMatrix<float>, 32> 1026 ns 1025 ns 675680 FLOPS=63.9093G/s INSTRUCTIONS=9.82477k
BM_MatMul<LtoMatrix<float>, 64> 7842 ns 7841 ns 88333 FLOPS=66.8612G/s INSTRUCTIONS=64.887k
BM_MatMul<LtoMatrix<float>, 128> 94233 ns 94220 ns 7549 FLOPS=44.5162G/s INSTRUCTIONS=490.042k
BM_MatMul<LtoMatrix<float>, 256> 929686 ns 929582 ns 755 FLOPS=36.0963G/s INSTRUCTIONS=7.70291M
the timings between Matrix<float>
and LtoMatrix<float>
seem pretty comparable
Both the LTO and the manually inlined versions produced similar quality vectorization. both using packed FMA operations for the bulk of the computational kernel.
00000000004023c0 <void BM_MatMul<Matrix<float>, 32ul>(benchmark::State&)>:
(...google benchmark stuff)
(...)
4025ed: call 4010f0 <operator new(unsigned long)@plt>
4025f2: mov edx,0x1000
4025f7: xor esi,esi
4025f9: mov rdi,rax
4025fc: call 4010a0 <memset@plt>
402601: mov rdi,r14
402604: xor esi,esi
402606: mov r8,rax
402609: mov rdx,rax
40260c: nop DWORD PTR [rax+0x0]
402610: vmovups ymm3,YMMWORD PTR [rdx]
402614: vmovups ymm2,YMMWORD PTR [rdx+0x20]
402619: mov rax,rbx
40261c: mov rcx,rdi
40261f: vmovups ymm1,YMMWORD PTR [rdx+0x40]
402624: vmovups ymm0,YMMWORD PTR [rdx+0x60]
402629: nop DWORD PTR [rax+0x0]
402630: vbroadcastss ymm5,DWORD PTR [rcx]
402635: vfmadd231ps ymm3,ymm5,YMMWORD PTR [rax]
40263a: add rax,0x100
402640: add rcx,0x8
402644: vfmadd231ps ymm2,ymm5,YMMWORD PTR [rax-0xe0]
40264d: vbroadcastss ymm4,DWORD PTR [rcx-0x4]
402653: vfmadd231ps ymm1,ymm5,YMMWORD PTR [rax-0xc0]
40265c: vfmadd231ps ymm3,ymm4,YMMWORD PTR [rax-0x80]
402662: vfmadd231ps ymm0,ymm5,YMMWORD PTR [rax-0xa0]
40266b: vfmadd231ps ymm2,ymm4,YMMWORD PTR [rax-0x60]
402671: vfmadd231ps ymm1,ymm4,YMMWORD PTR [rax-0x40]
402677: vfmadd231ps ymm0,ymm4,YMMWORD PTR [rax-0x20]
40267d: cmp rax,r15
402680: vmovups YMMWORD PTR [rdx],ymm3
402684: vmovups YMMWORD PTR [rdx+0x20],ymm2
402689: vmovups YMMWORD PTR [rdx+0x40],ymm1
40268e: vmovups YMMWORD PTR [rdx+0x60],ymm0
(loop stuff..)
00000000004033c0 <void BM_MatMul<LtoMatrix<float>, 32ul>(benchmark::State&)>:
(...google benchmark stuff)
(...)
4035ed: call 4010f0 <operator new(unsigned long)@plt>
4035f2: mov edx,0x1000
4035f7: xor esi,esi
4035f9: mov rdi,rax
4035fc: call 4010a0 <memset@plt>
403601: mov rdi,r14
403604: xor esi,esi
403606: mov r8,rax
403609: mov rdx,rax
40360c: nop DWORD PTR [rax+0x0]
403610: vmovups ymm3,YMMWORD PTR [rdx]
403614: vmovups ymm2,YMMWORD PTR [rdx+0x20]
403619: mov rax,rbx
40361c: mov rcx,rdi
40361f: vmovups ymm1,YMMWORD PTR [rdx+0x40]
403624: vmovups ymm0,YMMWORD PTR [rdx+0x60]
403629: nop DWORD PTR [rax+0x0]
403630: vbroadcastss ymm5,DWORD PTR [rcx]
403635: vfmadd231ps ymm3,ymm5,YMMWORD PTR [rax]
40363a: add rax,0x100
403640: add rcx,0x8
403644: vfmadd231ps ymm2,ymm5,YMMWORD PTR [rax-0xe0]
40364d: vbroadcastss ymm4,DWORD PTR [rcx-0x4]
403653: vfmadd231ps ymm1,ymm5,YMMWORD PTR [rax-0xc0]
40365c: vfmadd231ps ymm3,ymm4,YMMWORD PTR [rax-0x80]
403662: vfmadd231ps ymm0,ymm5,YMMWORD PTR [rax-0xa0]
40366b: vfmadd231ps ymm2,ymm4,YMMWORD PTR [rax-0x60]
403671: vfmadd231ps ymm1,ymm4,YMMWORD PTR [rax-0x40]
403677: vfmadd231ps ymm0,ymm4,YMMWORD PTR [rax-0x20]
40367d: cmp rax,r15
403680: vmovups YMMWORD PTR [rdx],ymm3
403684: vmovups YMMWORD PTR [rdx+0x20],ymm2
403689: vmovups YMMWORD PTR [rdx+0x40],ymm1
40368e: vmovups YMMWORD PTR [rdx+0x60],ymm0
(loop stuff..)
LTO successfully inlined the indexer method and followed up with vectorization.
A lot of C++ code aggressively templates in the hope that the resulting inlining will help optimizations. I have found actually that in the case of a lot of numeric code it is sufficient to use LTO.
This is useful for a few reasons. It can help compile times for “regular” builds to explicitly instantiate classes for supported types (e.g. float32,64 & complex64,128). The implementation only gets parsed once and can happen in parallel (separate instantiation file for each supported type, for example).
Furthermore it allows for better typechecking rather than deferring typechecking to the point that the class was used, you explicitly say what type you want to use beforehand and you get a much earlier signal of type problems.