Learning loop unrolling

February 12, 2024

In the excellent Freakonomics Radio podcast series about physicist Richard Feynman (stop reading now an go listen to it!) I heard an interesting tidbit from Feynman:

Knowing the name of something doesn’t mean you understand it. (video)

After reading a recent blog post from Modular about loop unrolling, I realized that I know what loop unrolling is, but I don’t really understand it. This post is my journey to a deeper understanding.

The setup

I decided to start this deep dive with the matrix multiplication documentation for Mojo, to see how loop unrolling impacts the naive “matmul” algorithm.

I’m doing all of this on an M2 macOS machine, but the results should apply equally to x64 processors.

I’m using a few excellent tools:

Ghidra to analyze the assembly code built by Mojo
Hyperfine for benchmarking
Samply for profiling

I’ll be fiddling with this Mojo code, which depends on the Matrix type implemented in the documentation linked above, using elements of type DType.float32. You can find all the code I used for this exploration on Github.

alias N = 1024

fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
  for m in range(N):
    for k in range(N):
      for n in range(N):
        C[m, n] += A[m, k] * B[k, n]
 
def main():
  var C = Matrix(N, N)
  var A = Matrix.rand(N, N)
  var B = Matrix.rand(N, N)

  matmul_naive(C, A, B)
 
  A.data.free()
  B.data.free()
  C.data.free()

The wrong path

Sometimes blog posts make it seem like the author knew what they were doing the whole time. For me at least, this is usually not the case. I’ll start by showing how I went down the wrong path.

First, I compiled the code with mojo build, then used Ghidra to look at the assembly code. Most of the work here is done in this loop (which is the inner-most of the three loops):

LAB_100000c48
ldr        s1,[x15], #0x4
ldr        s2,[x16]
fmadd      s1,s0,s1,s2
str        s1,[x16], #0x4
sub        x17,x17,#0x1
cmp        x17,#0x1
b.hi       LAB_100000c48

The third instruction - fmadd - is the key one here. It is doing one multiply and one add operation, all in one instruction, on 32-bit floating point values. After the mathematics, the loop counter in the x17 register is decremented, and the code jumps back to the top of the loop for the next iteration.

So, I wondered what would happen if I added the unroll directive to the outer most loop.

fn matmul_naive(C: Matrix, A: Matrix, B: Matrix):
  @unroll (4)
  for m in range(N):
    for k in range(N):
      for n in range(N):
        C[m, n] += A[m, k] * B[k, n]

The Mojo compiler dutifully gives me four loops, with three additional copies of the assembly code loop shown above plus some setup and teardown code in between (I’ll avoid showing them here to save space). So this is loop unrolling, cool!

The code must be faster now, right? Let’s see what Hyperfine tells us. I benchmarked the implementation without unrolling against unrolling the loop 4, 8, 16, and 32 times.

Benchmark 1: ./matmul
	Time (mean ± σ): 361.3 ms ± 3.1 ms [User: 342.3 ms, System: 2.0 ms]
	Range (min … max): 357.3 ms … 368.2 ms 10 runs

Benchmark 2: ./matmul_unrolled4
	Time (mean ± σ): 387.7 ms ± 1.7 ms [User: 369.3 ms, System: 1.9 ms]
	Range (min … max): 384.9 ms … 389.9 ms 10 runs

Benchmark 3: ./matmul_unrolled8
	Time (mean ± σ): 389.7 ms ± 5.2 ms [User: 373.5 ms, System: 1.7 ms]
	Range (min … max): 376.6 ms … 394.3 ms 10 runs

Benchmark 4: ./matmul_unrolled16
	Time (mean ± σ): 394.7 ms ± 1.5 ms [User: 376.3 ms, System: 1.8 ms]
	Range (min … max): 392.4 ms … 396.7 ms 10 runs

Benchmark 5: ./matmul_unrolled32
	Time (mean ± σ): 422.0 ms ± 2.4 ms [User: 403.4 ms, System: 1.9 ms]
	Range (min … max): 416.3 ms … 425.6 ms 10 runs

Summary
./matmul ran
	1.07 ± 0.01 times faster than ./matmul_unrolled4
	1.08 ± 0.02 times faster than ./matmul_unrolled8
	1.09 ± 0.01 times faster than ./matmul_unrolled16
	1.17 ± 0.01 times faster than ./matmul_unrolled32

Huh? The code was faster before I started to fiddle with loop unrolling. In all cases, I made the code execute more slowly! I was very confused. Had I read to the end of the Mojo matrix multiply documentation, I would have noticed that unroll makes more sense on the inner loop than the outer loop. But I instead went on a hunt to better understand loop unrolling in other languages.

Thankfully, I stumbled across the ARM C++ compiler documentation for #pragma unroll](which is how C++ compilers spell @unroll). Check out this example:

void matrix_multiply(float ** __restrict dest, float ** __restrict src1, float ** __restrict src2, unsigned int n) { 
  unsigned int i, j, k; 
  for (i = 0; i < n; i++) { 
    for (k = 0; k < n; k++) { 
      float sum = 0.0f;
      /* #pragma unroll */ 
      for(j = 0; j < n; j++) 
        sum += src1[i][j] * src2[j][k]; dest[i][k] = sum;
    }
  }
}

It looks very similar to the naive matrix multiply implementation in Mojo. You can fiddle with this example on Compiler Explorer to see the impact of loop unrolling in C++. It got me to thinking - maybe I should try to unroll the inner-most loop in Mojo as well.

Getting it right

As the Mojo documentation for matrix multiplication suggests, forcing the inner loop to unroll does improve performance. Let’s get straight to the results from Hyperfine:

Benchmark 1: ./matmul
  Time (mean ± σ):     362.0 ms ±   4.1 ms    [User: 341.3 ms, System: 1.8 ms]
  Range (min … max):   356.0 ms … 371.9 ms    10 runs
 
Benchmark 2: ./matmul_unrolled4
  Time (mean ± σ):     128.2 ms ±   3.9 ms    [User: 108.6 ms, System: 1.6 ms]
  Range (min … max):   124.4 ms … 143.4 ms    21 runs
 
Benchmark 3: ./matmul_unrolled8
  Time (mean ± σ):     116.2 ms ±   4.1 ms    [User: 98.3 ms, System: 1.6 ms]
  Range (min … max):   102.8 ms … 119.4 ms    24 runs
 
Benchmark 4: ./matmul_unrolled16
  Time (mean ± σ):     114.3 ms ±   3.9 ms    [User: 95.3 ms, System: 1.6 ms]
  Range (min … max):   104.1 ms … 125.7 ms    22 runs
 
Benchmark 5: ./matmul_unrolled32
  Time (mean ± σ):     114.3 ms ±   2.9 ms    [User: 95.3 ms, System: 1.6 ms]
  Range (min … max):   109.3 ms … 121.7 ms    23 runs
 
Summary
  ./matmul_unrolled32 ran
    1.00 ± 0.04 times faster than ./matmul_unrolled16
    1.02 ± 0.04 times faster than ./matmul_unrolled8
    1.12 ± 0.04 times faster than ./matmul_unrolled4
    3.17 ± 0.09 times faster than ./matmul

Now this is cool! The unrolled versions all ran about 3 times faster than the naive implementation. But this is not really a complete understanding. Why is the code so much faster with inner loop unrolled? Let’s have a look at the assembly code for that unrolled inner loop.

LAB_100001724                                   
ldr        q1,[x15], #0x10
ldr        q2,[x16]
fmla       v2.4S,v0.4S,v1.4S
str        q2,[x16], #0x10
sub        x17,x17,#0x4
cmp        x17,#0x4
b.hi       LAB_100001724

Can you spot the difference? It took me a while, so let’s look at the normal code and the unrolled code side by side.

Normal               Unrolled
================== | ======================
ldr s1,[x15], #0x4 | ldr q1,[x15], #0x10
ldr s2,[x16]       | ldr q2,[x16]
fmadd s1,s0,s1,s2  | fmla v2.4S,v0.4S,v1.4S
str s1,[x16], #0x4 | str q2,[x16], #0x10
sub x17,x17,#0x1   | sub x17,x17,#0x4
cmp x17,#0x1       | cmp x17,#0x4
b.hi LAB_100000c48 | b.hi LAB_100001724

Check out the multiply and add operation, it is spelled differently - fmla. This an ARM instruction that does floating point multiply and add, but on a vector of four values, rather than a scalar of one value. Also, notice the subtraction in register x17 - it now iterates the loop counter by 4, instead of 1. Each time through the unrolled loop, we get four operations, not just one.

We can confirm this by running the Samply profiler:

samply record ./matmul

samply record ./matmul_unrolled4

While Samply does not report exactly four times fewer calls to the loop decrement, it is pretty close. The unrolled inner loop is doing significantly fewer iterations.

Compilers ❤️ Processors

What is really going on here?

By telling the Mojo compiler to unroll that inner loop, we’re giving it a better chance to use the available hardware. ARMv8-A processors (like the ones in my Mac) have 32 registers that are 128 bits wide. The “S” prefix from the normal assembly code means “Word”, or a 32-bit value. The “Q” prefix from the unrolled assembly code means “Quadword”, or a 128-bit value.

In this example our data type is 32-bit floating point values, so the normal example is not using all of the available register space. It is only using 32 bits (one-fourth) of registers s0, s1, and s2 during each loop. But with the unrolled loop, the compiler is able to notice the chance to “vectorize” the code, and use the full width of each 128-bit register, where it can put four 32-bit floating point values side by side, and then use the fmla instruction to compute the multiply and add operation on all four together. The ARM documentation explains the interesting .4s syntax on each of the arguments to fmla.

Vn.4S - 4 lanes, each containing a 32-bit element

So the compiler is able to more efficiently use the registers available on this process.

Why do we need to manually unroll the inner loop to allow the compiler to see this possible optimization? I’m not entirely sure. I suspect the compiler has some limited budget for time spent on loop analysis. After all, we want our to to compile fast as well.

Where to next?

This example is certainly not the best possible matrix multiplication algorithm. The Modular documentation linked above goes into much more detail about techniques to improve its performance. But it does give us a fun little insight into loop unrolling though, and specifically into how it is implemented in Mojo. I wonder what this example looks like in other languages.

Appendix

Notice in the benchmarking data for the unrolled inner loop, we get slightly better performance as we unroll that inner loop more and more. Skipping by 8, 16, and eventually 32 iterations each yields small marginal gains - but why?

The assembly for the @unroll(32) case provides a hint:

LAB_1000016dc 
ldp        q1,q2,[x17, #-0x40]
ldp        q3,q4,[x16, #-0x40]
fmla       v3.4S,v0.4S,v1.4S
fmla       v4.4S,v0.4S,v2.4S
stp        q3,q4,[x16, #-0x40]
ldp        q1,q2,[x17, #-0x20]
ldp        q3,q4,[x16, #-0x20]
fmla       v3.4S,v0.4S,v1.4S
fmla       v4.4S,v0.4S,v2.4S
stp        q3,q4,[x16, #-0x20]
ldp        q1,q2,[x17]
ldp        q3,q4,[x16]
fmla       v3.4S,v0.4S,v1.4S
fmla       v4.4S,v0.4S,v2.4S
stp        q3,q4,[x16]
ldp        q1,q2,[x17, #0x20]
ldp        q3,q4,[x16, #0x20]
fmla       v3.4S,v0.4S,v1.4S
fmla       v4.4S,v0.4S,v2.4S
stp        q3,q4,[x16, #0x20]
sub        x0,x0,#0x20
add        x17,x17,#0x80
add        x16,x16,#0x80
cmp        x0,#0x20
b.hi       LAB_1000016dc

There are eight fmla instructions, set up in four pairs, where each instruction in a given pair is using different result registers. I’m not sure how the Apple M2 implementation works, but I suspect it can take advantage of hardware-level parallelism here, and exec both of the fmla instructions in a pair at the same time. It sounds like I have more to explore!