A zero cost abstraction?
November 21, 2018
Recently Joachim (CTO at Unity) has been talking about “performance by default”, the mantra that software should be as fast as possible from the outset. This is driving the pretty cool stuff many at Unity are doing around things like ECS, the C# job system, and Burst (find lots more about that here).
One question Joachim has asked internally of Unity developers is (I’m paraphrasing here): “What is the absolute lower bound of time this code could use?” This strikes me as a really useful way to think about performance. The question changes from “How fast is this?” to “How fast could this be?”. If the answers to those two questions are not the same, the next question is “Do we really need the additional overhead?”
Another way to think about this is to consider the zero-cost abstraction, a concept much discussed in the C++ and Rust communities. Programmers are always building abstractions, and those abstractions often lead to the difference between “how fast it is” and “how fast it could be”. We want to provide useful abstractions that don’t hurt performance.
Reading some bytes
I was thinking about all of this recently while writing some code to read bytes from a binary file. The first bit of code that rolled off my fingers was:
This feels like the “canonical” way to read bytes into a buffer. The API has no abstraction - the function gets exactly what it needs: a pointer to some memory location and the number of bytes to read into that memory location.
I ran clang-tidy on this code, and it was not happy:
binary_reader.cpp:20:5: error: do not use pointer arithmetic
[cppcoreguidelines-pro-bounds-pointer-arithmetic,-warnings-as-errors]
buffer[i] = ReadByte();
At first this error was a bit confusing, but after staring at the code a bit, I
think I determined why clang-tidy doesn’t like it: pointer arithmetic, while the
fastest way to address a buffer, is prone to errors. Specifically, the user of this
function can pass any value for size
. The function has no choice but to
dutifully write to memory where the client asked for it, so even a well-meaning
client who passes the wrong size
can cause memory corruption. We need an
abstraction which makes this function difficult to misuse.
Enter the span
What options do we have for an interface to ReadBytes
that makes it easy to pass
the size of the buffer correctly? Lets’ list a few
ReadBytes(std::vector<uint8_t>& buffer);
template<size_t N> ReadBytes(std::array<uint8_t, N>& buffer);
template<typename Iterator> ReadBytes(Interator begin, Iterator end);
All of these will work, but they seem a bit restrictive of the client in different ways. In addition, some of these APIs encode more information than we really need. All the client really wants to say is: “Here is a buffer I’ve set aside in memory, please fill it up with bytes, thanks!”
Thankfully, there is an abstraction in C++20 for a collection of objects of a given
type, along with their size:
std::span
. I’m not using
C++20 for this project, but I can use the same type from the GSL,
gsl::span
. I
wrote a bit about span’s cousin gsl::multi_span
earlier,
learning that it has a small, non-zero cost. I wanted to dive a bit deeper with
span.
The new implementation of ReadBytes
looks like this:
Better yet, I can call it with a number of different buffer types, all in a simple way:
For each of these cases, the compiler infers the size of the buffer, so I don’t have to make sure I pass the proper size of the buffer. Problem solved!
But at what cost?
Now we have a safe API to read bytes into a buffer, and we know the baseline fastest way to read bytes into the buffer. If performance by default matters, we need to know how much we pay for this safe abstraction.
One good way to understand the cost of code is to investigate the generated assembly code, where there is very little abstraction. Take a look at this comparison (the first implementation is on the left, the safe one is on the right).
The safe implementation only costs us one additional comparison instruction (line 15 on the right) at the start of the for loop. I think this is checking an error condition, so I expect most of the time this branch will not be taken. I assume the processor will notice that pretty quickly and optimize for the non-error case.
Note that we do pay a code size price for this abstraction as well. We have five additional instructions here.
Lets measure the run time performance cost. It looks like it will be small, but can we be sure? I wrote this benchmark:
And here are the results:
Running ./bench
Run on (8 X 2693.7 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 1024K (x4)
L3 Unified 33792K (x4)
Load Average: 0.08, 0.15, 0.13
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
ReadEightBytesRaw 365 ns 365 ns 1918628
ReadEightBytesSpan 364 ns 364 ns 1921003
Although much of the time in the benchmark is used opening and seeking in the binary file, the profiler indicates the percent of time spent in the byte reading code is nearly identical, with maybe slight advantage to the raw version, which might not even be outside the margin of error.
So using span
as an abstraction for an arbitrary length buffer to receive data
from a function provides:
- A safe interface
- A flexible interface
- No cost over the best performing case
In this case, span
gives us the API we want with performance by default.
Update (November 24)
Reddit user u/TheThiefMaster points
out that GCC will optimize the span
case so that
the code is the same as the unsafe case, meaning it really has no overhead.
Discuss this post on Reddit here.