Vectorization through Link-Time-Optimization (LTO)

Here I look at the ability of gcc to vectorize through LTO. Vectorization often occurs after inlining certain functions, for example matrix index methods. But if these methods happen to be in a separate translation unit can we still achieve vectorization after enabling LTO? The “first-principles” answer is “obviously yes it should” but one should never just take this for granted. I test it here.

To test this I build two identical matrix classes: Matrix<T> and LtoMatrix<T>. The difference between them is that Matrix<T> is a standard inlined template class, but LtoMatrix<T> manually instantiates LtoMatrix<float> and provides a separate ltomatrix.cpp file for separate compilation. Both matrix classes implement operator()(size_t,size_t) and this entrypoint is usually where vectorization is inferred by compilers, for example if the inner-loop of matrix-matrix multiplication occurs on the fastest dimension of the matrix and it inlined operator()(size_t,size_t) then it will correctly deduce that it can issue packed vector instructions. but in the case of LtoMatrix this method is in a separate translation unit and can’t be directly inlined. With LTO enabled however it can. but does that mean it will?

The experiment

I wrote a simple experiment here: https://github.com/ReidAtcheson/lto_vectorization/tree/main. It includes a spack.yaml in case you wanted to use the same compiler versions I tested this with.

Benchmark results

reidatcheson@pop-os:~/clones/lto_vectorization$ ./run.sh
-- Configuring done (0.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/reidatcheson/clones/lto_vectorization/build
[ 25%] Building CXX object CMakeFiles/ltomatrix.dir/ltomatrix.cpp.o
[ 50%] Linking CXX static library libltomatrix.a
[ 50%] Built target ltomatrix
[ 75%] Building CXX object CMakeFiles/main.dir/main.cpp.o
[100%] Linking CXX executable main
[100%] Built target main
2025-06-01T14:14:28-04:00
Running ./main
Run on (16 X 5000 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1280 KiB (x8)
  L3 Unified 18432 KiB (x1)
Load Average: 0.94, 0.71, 0.66
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------
BM_MatMul<Matrix<float>, 32>            964 ns          963 ns       718122 FLOPS=68.0343G/s INSTRUCTIONS=9.96k
BM_MatMul<Matrix<float>, 64>           7920 ns         7919 ns        86617 FLOPS=66.2042G/s INSTRUCTIONS=64.887k
BM_MatMul<Matrix<float>, 128>         97584 ns        97561 ns         7354 FLOPS=42.9916G/s INSTRUCTIONS=495.382k
BM_MatMul<Matrix<float>, 256>        908046 ns       907628 ns          797 FLOPS=36.9694G/s INSTRUCTIONS=7.73568M
BM_MatMul<LtoMatrix<float>, 32>        1026 ns         1025 ns       675680 FLOPS=63.9093G/s INSTRUCTIONS=9.82477k
BM_MatMul<LtoMatrix<float>, 64>        7842 ns         7841 ns        88333 FLOPS=66.8612G/s INSTRUCTIONS=64.887k
BM_MatMul<LtoMatrix<float>, 128>      94233 ns        94220 ns         7549 FLOPS=44.5162G/s INSTRUCTIONS=490.042k
BM_MatMul<LtoMatrix<float>, 256>     929686 ns       929582 ns          755 FLOPS=36.0963G/s INSTRUCTIONS=7.70291M

the timings between Matrix<float> and LtoMatrix<float> seem pretty comparable

Inspecting assembly

Both the LTO and the manually inlined versions produced similar quality vectorization. both using packed FMA operations for the bulk of the computational kernel.

Matrix

00000000004023c0 <void BM_MatMul<Matrix<float>, 32ul>(benchmark::State&)>:
(...google benchmark stuff)
(...)
  4025ed: call   4010f0 <operator new(unsigned long)@plt>
  4025f2: mov    edx,0x1000
  4025f7: xor    esi,esi
  4025f9: mov    rdi,rax
  4025fc: call   4010a0 <memset@plt>
  402601: mov    rdi,r14
  402604: xor    esi,esi
  402606: mov    r8,rax
  402609: mov    rdx,rax
  40260c: nop    DWORD PTR [rax+0x0]
  402610: vmovups ymm3,YMMWORD PTR [rdx]
  402614: vmovups ymm2,YMMWORD PTR [rdx+0x20]
  402619: mov    rax,rbx
  40261c: mov    rcx,rdi
  40261f: vmovups ymm1,YMMWORD PTR [rdx+0x40]
  402624: vmovups ymm0,YMMWORD PTR [rdx+0x60]
  402629: nop    DWORD PTR [rax+0x0]
  402630: vbroadcastss ymm5,DWORD PTR [rcx]
  402635: vfmadd231ps ymm3,ymm5,YMMWORD PTR [rax]
  40263a: add    rax,0x100
  402640: add    rcx,0x8
  402644: vfmadd231ps ymm2,ymm5,YMMWORD PTR [rax-0xe0]
  40264d: vbroadcastss ymm4,DWORD PTR [rcx-0x4]
  402653: vfmadd231ps ymm1,ymm5,YMMWORD PTR [rax-0xc0]
  40265c: vfmadd231ps ymm3,ymm4,YMMWORD PTR [rax-0x80]
  402662: vfmadd231ps ymm0,ymm5,YMMWORD PTR [rax-0xa0]
  40266b: vfmadd231ps ymm2,ymm4,YMMWORD PTR [rax-0x60]
  402671: vfmadd231ps ymm1,ymm4,YMMWORD PTR [rax-0x40]
  402677: vfmadd231ps ymm0,ymm4,YMMWORD PTR [rax-0x20]
  40267d: cmp    rax,r15
  402680: vmovups YMMWORD PTR [rdx],ymm3
  402684: vmovups YMMWORD PTR [rdx+0x20],ymm2
  402689: vmovups YMMWORD PTR [rdx+0x40],ymm1
  40268e: vmovups YMMWORD PTR [rdx+0x60],ymm0
(loop stuff..)

LtoMatrix

00000000004033c0 <void BM_MatMul<LtoMatrix<float>, 32ul>(benchmark::State&)>:
(...google benchmark stuff)
(...)
  4035ed: call   4010f0 <operator new(unsigned long)@plt>
  4035f2: mov    edx,0x1000
  4035f7: xor    esi,esi
  4035f9: mov    rdi,rax
  4035fc: call   4010a0 <memset@plt>
  403601: mov    rdi,r14
  403604: xor    esi,esi
  403606: mov    r8,rax
  403609: mov    rdx,rax
  40360c: nop    DWORD PTR [rax+0x0]
  403610: vmovups ymm3,YMMWORD PTR [rdx]
  403614: vmovups ymm2,YMMWORD PTR [rdx+0x20]
  403619: mov    rax,rbx
  40361c: mov    rcx,rdi
  40361f: vmovups ymm1,YMMWORD PTR [rdx+0x40]
  403624: vmovups ymm0,YMMWORD PTR [rdx+0x60]
  403629: nop    DWORD PTR [rax+0x0]
  403630: vbroadcastss ymm5,DWORD PTR [rcx]
  403635: vfmadd231ps ymm3,ymm5,YMMWORD PTR [rax]
  40363a: add    rax,0x100
  403640: add    rcx,0x8
  403644: vfmadd231ps ymm2,ymm5,YMMWORD PTR [rax-0xe0]
  40364d: vbroadcastss ymm4,DWORD PTR [rcx-0x4]
  403653: vfmadd231ps ymm1,ymm5,YMMWORD PTR [rax-0xc0]
  40365c: vfmadd231ps ymm3,ymm4,YMMWORD PTR [rax-0x80]
  403662: vfmadd231ps ymm0,ymm5,YMMWORD PTR [rax-0xa0]
  40366b: vfmadd231ps ymm2,ymm4,YMMWORD PTR [rax-0x60]
  403671: vfmadd231ps ymm1,ymm4,YMMWORD PTR [rax-0x40]
  403677: vfmadd231ps ymm0,ymm4,YMMWORD PTR [rax-0x20]
  40367d: cmp    rax,r15
  403680: vmovups YMMWORD PTR [rdx],ymm3
  403684: vmovups YMMWORD PTR [rdx+0x20],ymm2
  403689: vmovups YMMWORD PTR [rdx+0x40],ymm1
  40368e: vmovups YMMWORD PTR [rdx+0x60],ymm0
(loop stuff..)

Conclusions

Why do this?

A lot of C++ code aggressively templates in the hope that the resulting inlining will help optimizations. I have found actually that in the case of a lot of numeric code it is sufficient to use LTO.

This is useful for a few reasons. It can help compile times for “regular” builds to explicitly instantiate classes for supported types (e.g. float32,64 & complex64,128). The implementation only gets parsed once and can happen in parallel (separate instantiation file for each supported type, for example).

Furthermore it allows for better typechecking rather than deferring typechecking to the point that the class was used, you explicitly say what type you want to use beforehand and you get a much earlier signal of type problems.