Better Shell Setup

This also installs some common dependencies if having sudo access. If not, comment out the sudo commands.

1
2
3
git clone https://github.com/Co1lin/ML-Env-Setup
cd ML-Env-Setup
bash 0_basic.sh

Environment Variables

Set these environment variables in your .bashrc and/or .zshrc.

1
2
3
export HF_HOME=<a shared folder on a fast storage>
export HF_TOKEN_PATH=<a personal folder>
export TMPDIR=<redirect this when /tmp is on / and the storage is limited>

NVIDIA GPU Related

  • Always monitor the status of GPUs when you plan to use or are using them. nvitop (pip install nvitop) is recommended.
  • Only use GPUs that are needed for your task! Do NOT blindly use all GPUs! Limit visible GPUs to your processes by export CUDA_VISIBLE_DEVICES=2,3, for example.
    • Monitor the memory usage to help you decide how many GPUs to use.
    • Monitor “GPU-Util”
      • to make sure the GPUs are not idle. If the GPU-Util keeps at 0%, you should suspect that your processes get stuck!
      • to help you optimize your code, if GPU-Util is low.
  • Make sure you don’t have any dead process occupying the GPUs! One way to exit your process: https://blog.co1in.me/skills/py-embed-trick/#Exit-the-program.
    • Show dead processes that cannot be shown by nvidia-smi:
      1
      2
      # sudo needed to view processes of other users
      fuser -v /dev/nvidia*
  • Avoid using GPUs that are in use by other users! (Unless you are very sure it will not affect them by OOM or other reasons.)

Python Related

When fuzzing the compilation stack in PyTorch 2.0 with NeuRI, we found some interesting bugs. Here we reveal one of them caused by the misused C++ keyword __restrict__.

The __restrict__ Keyword in C++

First, let’s take a look at the effect of the __restrict__ keyword in C++ through a simple example below.

1
2
3
4
5
6
7
8
9
10
// test.cpp
void f1(int* a, int* b, int* x) {
*a += *x;
*b += *x;
}

void f2(int* __restrict__ a, int* __restrict__ b, int* __restrict__ x) {
*a += *x;
*b += *x;
}

The difference of the two functions is that, all the pointer arguments in f2 are decorated with __restrict__, while the ones in f1 are not. __restrict__ tells the compiler that all these pointers are unique, which means they will not refer to the same memory addresss. Then, the compiler can do some optimizations. Let’s see how the compiler optimize it through the assembly code.

We get the assembly code below by

1
2
clang-14 -c -g -O1 test.cpp -o test.o
llvm-objdump-14 -S test.o > test.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
; test.S
Disassembly of section .text:

0000000000000000 <_Z2f1PiS_S_>:
; *a += *x;
0: 8b 02 movl (%rdx), %eax
2: 01 07 addl %eax, (%rdi)
; *b += *x;
4: 8b 02 movl (%rdx), %eax
6: 01 06 addl %eax, (%rsi)
; }
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl (%rax)

0000000000000010 <_Z2f2PiS_S_>:
; *a += *x;
10: 8b 02 movl (%rdx), %eax
12: 01 07 addl %eax, (%rdi)
; *b += *x;
14: 01 06 addl %eax, (%rsi)
; }
16: c3 retq

The first half of the assembly code corresponds to f1, while the remaining one corresponds to f2. We can see the only difference is that in assembly for f2, the second movl (%rdx), %eax instruction is omitted.

Normally, *b += *x will be compiled into two instructions in x86 like the assembly for f1. First, it needs to load *x from memory ((%rdx)) to a register (%eax), then it adds the value in this register to the data stored in memory, which is *b ((%rsi)). You may notice that *x is loaded in the first instruction for *a += *x;, but we still need to load it again for *b += *x; in case the data pointed by x is changed–it is exactly what happens when a == x (a and x point to the same memory address).

However, decorating a and x with __restrict__ will tell the compiler they are different. As a result, the compiler believes that *x will not be changed in f2, so it will only load it once.

The PyTorch Bug

By running the fuzzer NeuRI, we find a bug which can be triggered by the code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch

p0 = torch.tensor([[4.9334, 5.5571]]) # (1, 2)

def fn():
v7 = torch.cat([p0, p0], dim=0) # v7: (2, 2)
v1 = torch.mul(v7, v7) # v1: (2, 2)
return v7, v1

ret_eager = fn()
compiled = torch.compile(fn)
ret_compiled = compiled()

assert torch.allclose(ret_eager[0], ret_compiled[0])
# ^^^ no error
assert torch.allclose(ret_eager[1], ret_compiled[1])
''' ^^^ WRONG!
AssertionError:
ret_eager[1] = tensor([[24.3384, 30.8814],
[24.3384, 30.8814]])
ret_compiled[1] = tensor([[0., 0.],
[0., 0.]])
'''

As you can see, fn is composed by two tensor operations. After compilation, it gives wrong results for the second return value v1. (All values in v1 are zeros, which is incorrect.)

What torch.compile does is that, it generates a C++ kernel function to compute fn. We add some comments to help to understand how the C++ function implements the Python function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
extern "C" void kernel(const float* __restrict__ in_ptr0, // p0
const float* __restrict__ in_ptr1, // p0
const float* __restrict__ in_ptr2, // v7
float* __restrict__ out_ptr0, // first half of v7
float* __restrict__ out_ptr1, // last half of v7
float* __restrict__ out_ptr2) // v1
{
{ // 1st part of cat operation: copy values in p0 to the first half of v7
#pragma GCC ivdep
for(long i0=0; i0<2; i0+=1)
{
auto tmp0 = in_ptr0[i0];
out_ptr0[i0] = tmp0;
}
}
{ // 2nd part of cat operation: copy values in p0 to the last half of v7
#pragma GCC ivdep
for(long i0=0; i0<2; i0+=1)
{
auto tmp0 = in_ptr1[i0];
out_ptr1[i0] = tmp0;
}
}
{ // mul operation: v1 <- element-wise multiplication of v7 and v7
#pragma GCC ivdep
for(long i0=0; i0<4; i0+=1)
{
auto tmp0 = in_ptr2[i0];
auto tmp1 = tmp0 * tmp0;
out_ptr2[i0] = tmp1;
}
}
}

As you see, it uses __restrict__ for all pointer arguments. It indicates that they are different. But actually, they are NOT. in_ptr2 points to the low-level memory address of tensor v7, while out_ptr0 points to the first half of v7 and out_ptr1 points to the last half one. They are overlapped.

The values of v7 are changed by writing to addresses referred by out_ptr0 and out_ptr1 in the first two for loops. So, for reading data of v7 by in_ptr2 in the last for loop, it should load the values after writing to out_ptr0 and out_ptr1. If it loads them before, it should reload it to ensure the correctness. Otherwise, old values stored in v7 will be used to do the multiplication. I guess that’s why the compiled function gives zeros.

Finally, the developers fixed this bug by removing the usage of __restrict__ keywords in code generation.

However, I could not reproduce the wrong behavior led by __restrict__ at the assembly level. I tried to compile the cpp function above by clang-14 -c -g -O3 k.cpp -o k.o && llvm-objdump-14 -S k.o and got the assembly code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0000000000000000 <kernel>:
; out_ptr0[i0] = tmp0;
0: 48 8b 07 movq (%rdi), %rax
3: 48 89 01 movq %rax, (%rcx)
; out_ptr1[i0] = tmp0;
6: 48 8b 06 movq (%rsi), %rax
9: 49 89 00 movq %rax, (%r8)
c: 31 c0 xorl %eax, %eax
e: 66 90 nop
; auto tmp0 = in_ptr2[i0];
10: f3 0f 10 04 82 movss (%rdx,%rax,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
; auto tmp1 = tmp0 * tmp0;
15: f3 0f 59 c0 mulss %xmm0, %xmm0
; out_ptr2[i0] = tmp1;
19: f3 41 0f 11 04 81 movss %xmm0, (%r9,%rax,4)
; for(long i0=0; i0<4; i0+=1)
1f: 48 83 c0 01 addq $1, %rax
23: 48 83 f8 04 cmpq $4, %rax
27: 75 e7 jne 0x10 <kernel+0x10>
; }
29: c3 retq

I think it works correctly since it loads the operands for multiplication by movss (%rdx,%rax,4), %xmm0 which means reading values from memory. It doesn’t load the value first and then use the old data. So I don’t clearly know why the compiled function in PyTorch gives wrong results. We can only say that __restrict__ should not be used there by its definition.

Insert the breakpoint

There’s a useful trick for efficiently debugging Python code. Say if you have a loop like the one below, how to interactively access the list l at each loop step?

1
2
3
l = []
for i in range(5):
l.append(i)

You can insert a “breakpoint” as follows.

1
2
3
4
5
6
# test.py
from IPython import embed # pip install ipython
l = []
for i in range(5):
l.append(i)
embed()

Then you run python test.py in the shell, and an interactive environment will be prompted out like this:

1
2
3
4
5
6
colin ❯ python test.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

The program stops at the position of embed(), and you can access variables visible at this point, like:

1
2
In [1]: l
Out[1]: [0]

You can also execute most kinds of Python code here, like:

1
2
3
4
5
6
7
In [2]: l.append(100)

In [3]: l
Out[3]: [0, 100] # l is changed!

In [4]: import random; print(random.random())
0.42541864192778645

You can use quit to continue running the program, and the program will stop at the next breakpoint if there’s any. Ctrl+D is equivalent to this.

1
2
3
4
5
6
7
8
In [5]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 100, 1]

Exit the program

Sometimes you insert embed() inside a loop that will repeat many times, and you just want to exit running the program at some time. But you will find both Ctrl+C and Ctrl+D do not work here, so we can just close the shell and the process will be terminated. :)

Closing the shell works, but we have something better. Say now you are in the interactive environment provided by embed. You can press Ctrl+Z here, and then you are back to your shell and see something like below.

1
2
3
4
In [2]: # press Ctrl+Z here!

[1] + 4810 suspended python test.py
colin ❯ # we are back to the shell!

BUT we’re NOT done yet! Ctrl+Z just sends the “terminal stop” signal (SIGTSTP) to the foreground running process. The process will not take any more CPU resources, but it still occupies memory and ISN’T dead yet. You can even use fg to bring it back!

1
2
3
colin ❯ fg
[1] - 4810 continued ipython
In [2]: # you can continue to use the interactive environment here

To terminate the process completely, you need to use kill -9 command, which sends a SIGKILL signal indicating to a service to shutdown immediately. In our case, you can execute kill -9 %1 to terminate the process just suspended by Ctrl+Z.

%1 means “job number 1” in the current shell. You can run jobs to list all jobs in the current shell, like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
colin ❯ ipython # run ipython in the shell first
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.


[1] + 6862 suspended ipython # Ctrl+Z to suspend ipython
colin ❯ python # then run python in the shell
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
[2] + 6879 suspended python # Ctrl+Z to suspend python
colin ❯ jobs # list all jobs in the current shell
[1] - suspended ipython # can be terminated with kill -9 %1
[2] + suspended python # can be terminated with kill -9 %2

Combining Ctrl+Z and kill -9 is also useful for stopping a process immediately. Because sometimes, after Ctrl+C, the process will do some post-processing which may take a long time. Then you can use this way to stop it right now.

Deactivate the current embed

Thanks to my friend @Leo‘s reminder, we can use %kill_embedded in the ipython interactive environment to deactivate the current embed() but keep others working. For example, in the program below, after stopping at the embed() in the first loop for 2 times, we can do %kill_embedded with confirming it and quit to skip the remaining ones in the first loop, while the embed() in the second loop still works so we will stop there.

1
2
3
4
5
6
7
8
9
10
11
12
13
# test2.py
from IPython import embed

l = []
for i in range(4):
l.append(i)
embed()

print('==== finish the first loop! ====')

for i in range(4, 8):
l.append(i)
embed()

Execuation log in shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
colin ❯ python test2.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0]

In [2]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 1]

In [2]: %kill_embedded <<<<---- deactivate this embed
Namespace(instance=False, exit=False, yes=False)
Are you sure you want to kill this embedded call_location? [y/N] y <<<<---- confirm the deactivation
This embedded IPython call location will not reactivate anymore once you exit.

In [3]: l
Out[3]: [0, 1]

In [4]: quit <<<<---- quit the second stop

==== finish the first loop! ==== <<<<---- embed in the first loop will not work any more and we directly reach here
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l <<<<---- stop at the embed in the second loop
Out[1]: [0, 1, 2, 3, 4]

In [2]: quit

References:

What is effect of CTRL + Z on a unix\Linux application
https://superuser.com/questions/275433/what-does-1-in-kill-1-mean

0%