A PyTorch Bug Caused by The Misused C++ Keyword restrict
When fuzzing the compilation stack in PyTorch 2.0 with NeuRI, we found some interesting bugs. Here we reveal one of them caused by the misused C++ keyword __restrict__
.
The __restrict__
Keyword in C++
First, let’s take a look at the effect of the __restrict__
keyword in C++ through a simple example below.
1 | // test.cpp |
The difference of the two functions is that, all the pointer arguments in f2
are decorated with __restrict__
, while the ones in f1
are not. __restrict__
tells the compiler that all these pointers are unique, which means they will not refer to the same memory addresss. Then, the compiler can do some optimizations. Let’s see how the compiler optimize it through the assembly code.
We get the assembly code below by
1
2 clang-14 -c -g -O1 test.cpp -o test.o
llvm-objdump-14 -S test.o > test.S
1 | ; test.S |
The first half of the assembly code corresponds to f1
, while the remaining one corresponds to f2
. We can see the only difference is that in assembly for f2
, the second movl (%rdx), %eax
instruction is omitted.
Normally, *b += *x
will be compiled into two instructions in x86 like the assembly for f1
. First, it needs to load *x
from memory ((%rdx)
) to a register (%eax
), then it adds the value in this register to the data stored in memory, which is *b
((%rsi)
). You may notice that *x
is loaded in the first instruction for *a += *x;
, but we still need to load it again for *b += *x;
in case the data pointed by x
is changed–it is exactly what happens when a == x
(a
and x
point to the same memory address).
However, decorating a
and x
with __restrict__
will tell the compiler they are different. As a result, the compiler believes that *x
will not be changed in f2
, so it will only load it once.
The PyTorch Bug
By running the fuzzer NeuRI, we find a bug which can be triggered by the code below.
1 | import torch |
As you can see, fn
is composed by two tensor operations. After compilation, it gives wrong results for the second return value v1
. (All values in v1
are zeros, which is incorrect.)
What torch.compile
does is that, it generates a C++ kernel function to compute fn
. We add some comments to help to understand how the C++ function implements the Python function.
1 | extern "C" void kernel(const float* __restrict__ in_ptr0, // p0 |
As you see, it uses __restrict__
for all pointer arguments. It indicates that they are different. But actually, they are NOT. in_ptr2
points to the low-level memory address of tensor v7
, while out_ptr0
points to the first half of v7
and out_ptr1
points to the last half one. They are overlapped.
The values of v7
are changed by writing to addresses referred by out_ptr0
and out_ptr1
in the first two for loops. So, for reading data of v7
by in_ptr2
in the last for loop, it should load the values after writing to out_ptr0
and out_ptr1
. If it loads them before, it should reload it to ensure the correctness. Otherwise, old values stored in v7
will be used to do the multiplication. I guess that’s why the compiled function gives zeros.
Finally, the developers fixed this bug by removing the usage of __restrict__
keywords in code generation.
However, I could not reproduce the wrong behavior led by __restrict__
at the assembly level. I tried to compile the cpp function above by clang-14 -c -g -O3 k.cpp -o k.o && llvm-objdump-14 -S k.o
and got the assembly code below.
1 | 0000000000000000 <kernel>: |
I think it works correctly since it loads the operands for multiplication by movss (%rdx,%rax,4), %xmm0
which means reading values from memory. It doesn’t load the value first and then use the old data. So I don’t clearly know why the compiled function in PyTorch gives wrong results. We can only say that __restrict__
should not be used there by its definition.