Minimum implementations

April 4, 2017

I really enjoy reading the Bit Twiddling Hacks page. It has lots of great bit-sized (pun intended) algorithm implementations optimized for performance (or at least number of operations). Combine this with Compiler Explorer, and you’ve got hours of assembly language fun. For example, I thought it might be interesting to understand how different implementations of the min algorithm generate assembly code.

Let’s look at three different implementations, all from this section of the page. You can find the code generated by GCC 6.3 here.

No branches, please

Naturally, you would expect the min algorithm to require at least one branch, since it is essentially an “if” check. But with some fun bitwise operations, this branch can be avoided.

int no_branch_min(int x, int y) {
  return y ^ ((x ^ y) & -(x < y));
}

GCC produces this assembly (with an optimized compile), for x64:

1
2
3
4
5
6
7
8
9
no_branch_min(int, int):
  xor     eax, eax
  cmp     edi, esi
  setl    al
  xor     edi, esi
  neg     eax
  and     eax, edi
  xor     eax, esi
  ret

Let’s step through this code line by line to understand how the algorithm works. I’ll assume the arguments are the values 2 and 3.

  • Line 1: It looks like the arguments are passed in the edi and esi registers.

eax:? edi:0x00000002 esi:0x00000003 EFLAGS - SF:? OF:?

  • Line 2: Clear the eax register (it will be used later).

eax:0x00000000 edi:0x00000002 esi:0x00000003 EFLAGS - SF:? OF:?

  • Line 3: The cmp instruction subtracts the two arguments, and sets status flags, but doesn’t change any other register values. So 2 - 3 = -1; the sign flag is set and the overflow flag is not.

eax:0x00000000 edi:0x00000002 esi:0x00000003 EFLAGS - SF:1 OF:0

  • Line 4: The setl instruction sets its destination operand, in this case, the al register (the lower 16 bytes of eax) to a value of 1 if SF != OF. That is the case.

eax:0x00000001 edi:0x00000002 esi:0x00000003

  • Line 5: The xor instruction performs an exclusive or, and puts the result in the first register.

eax:0x00000001 edi:0x00000001 esi:0x00000003

  • Line 6: The neg instruction takes the two’s complement of its argument.

eax:0xffffffff edi:0x00000001 esi:0x00000003

  • Line 7: The logical and instruction next stores its result in eax. Again, it is zero.

eax:0x00000001 edi:0x00000001 esi:0x00000003

  • Line 8: The xor instruction of 1 and 3 results in 2, which is stored in eax and returned.

eax:0x00000002 edi:0x00000001 esi:0x00000003

So as advertised, this algorithm does not take any branches, and indeed returns the minimum (at least of 2 and 3). How does it compare to the minimum algorithm in the standard library though?

Using std::min

I had to do a double-take after looking at the code generated for std::min. Here it is, in all of its tiny glory:

1
2
3
4
cmp     esi, edi
mov     eax, edi
cmovle  eax, esi
ret

That’s right, GCC compiled this minimum implementation into a real minimum implementation, just four instructions! Let’s understand what they are doing, again with the inputs 2 and 3.

  • Line 1: The arguments are passed in the edi and esi registers. Notice that this cmp instruction is different from the first case, here it does 3 - 2 = 1, so no status flags are set.

eax:? edi:0x00000002 esi:0x00000003 EFLAGS - SF:0 OF:0

  • Line 2: Via mov, the value of 2 is copied from edi to eax. This really just sets up eax to be the return value of the function, it doesn’t have an impact on the algorithm.

eax:0x00000002 edi:0x00000002 esi:0x00000003 EFLAGS - SF:0 OF:0

  • Line 3: Finally, the meat of the algorithm is in cmovle. This instruction does a conditional move, replacing eax with esi only if the previous cmp found that the left operand was less than or equal to the right operand. In other words, if the zero flag is set or the sign flag is not equal to the overflow flag. Neither condition is true in this case, so the value of 2 remains in eax and is returned.

eax:0x00000002 edi:0x00000002 esi:0x00000003 EFLAGS - SF:0 OF:0

Even in this case, there are no branches! The compiler has replaced the branch by a conditional instruction.

Let’s be naive

Suppose instead that I roll my own implementation of min, using a very clear branch:

int naive_min(int x, int y) {
  if (x < y)
    return x;
   return y;
}

This code must have a branching instruction, right? Wrong!

1
2
3
4
cmp     edi, esi
mov     eax, esi
cmovle  eax, edi
ret

The compiler has provided the same implementation as std::min. It inferred what I wanted to do and wrote a better implementation than I could write myself.

Time to double check

Look at again at the entire assembly code. In the pane for compiler #2, I’ve used GCC 6.3 as well, but this time without any optimizations. Although the three different implementations generate more assembly code than the optimized compile, notice that they are still different from on another. The no_branch_min code does indeed have no branches, while standard_min and naive_min both have one conditional jump and one unconditional jump instruction.

So it is true that the no_branch_min implementation guarantees no branches, but a good optimizing compiler will generate better code for the more human-readable std::min and even the naive minimum implementation.

Let’s face it, optimizing C++ compilers are simply amazing.


Content © Josh Peterson

Site design by Sirupsen