1
2
3
4
print(list(filter(
lambda v: not (is_young(v) or is_simple(v) or is_naive(v)),
Bilibili('colin').subscriptions + YouTube('colin').subscriptions,
)))
Channel/UP Tags Comments
sun riches 经济; 政治; 时事 每日一更,以经济相关的新闻为主。在树洞上的一个推荐帖里看到的。(怀念树洞。)刚开始不习惯机器读的声音,后来习惯了;对一些眼下的发生的事情有比较理性的评析。
Vox 科普; 时事 可能是回形针参考的原型之一。
回形针PaperClip 科普 可视化做得不错的的中文科普,很多选题也比较有意思。但由于轻易上纲上线而不宽容的舆论环境被封杀了,可惜。
马督工 时事 也许是国内最敢说的时评类自媒体。观点见仁见智,但媒体不能给出专业严肃的学术观点,其任务是基于事实与逻辑以普通人的视角呈现世界。
聂辉华教授 政治经济学 http://www.niehuihua.com 也许很少有社会问题/现象与经济学无关。
蒙克MK_ 社会学 从社会现代性的角度思考人与他人、与工作、与社会等事物的关系。
毛立平 历史 清朝历史。
投资人黄海 商业
全嘻嘻 生活 令人轻松的聊天,贵在真诚。
机坪夜话 航空

When fuzzing the compilation stack in PyTorch 2.0 with NeuRI, we found some interesting bugs. Here we reveal one of them caused by the misused C++ keyword __restrict__.

The __restrict__ Keyword in C++

First, let’s take a look at the effect of the __restrict__ keyword in C++ through a simple example below.

1
2
3
4
5
6
7
8
9
10
// test.cpp
void f1(int* a, int* b, int* x) {
*a += *x;
*b += *x;
}

void f2(int* __restrict__ a, int* __restrict__ b, int* __restrict__ x) {
*a += *x;
*b += *x;
}

The difference of the two functions is that, all the pointer arguments in f2 are decorated with __restrict__, while the ones in f1 are not. __restrict__ tells the compiler that all these pointers are unique, which means they will not refer to the same memory addresss. Then, the compiler can do some optimizations. Let’s see how the compiler optimize it through the assembly code.

We get the assembly code below by

1
2
clang-14 -c -g -O1 test.cpp -o test.o
llvm-objdump-14 -S test.o > test.S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
; test.S
Disassembly of section .text:

0000000000000000 <_Z2f1PiS_S_>:
; *a += *x;
0: 8b 02 movl (%rdx), %eax
2: 01 07 addl %eax, (%rdi)
; *b += *x;
4: 8b 02 movl (%rdx), %eax
6: 01 06 addl %eax, (%rsi)
; }
8: c3 retq
9: 0f 1f 80 00 00 00 00 nopl (%rax)

0000000000000010 <_Z2f2PiS_S_>:
; *a += *x;
10: 8b 02 movl (%rdx), %eax
12: 01 07 addl %eax, (%rdi)
; *b += *x;
14: 01 06 addl %eax, (%rsi)
; }
16: c3 retq

The first half of the assembly code corresponds to f1, while the remaining one corresponds to f2. We can see the only difference is that in assembly for f2, the second movl (%rdx), %eax instruction is omitted.

Normally, *b += *x will be compiled into two instructions in x86 like the assembly for f1. First, it needs to load *x from memory ((%rdx)) to a register (%eax), then it adds the value in this register to the data stored in memory, which is *b ((%rsi)). You may notice that *x is loaded in the first instruction for *a += *x;, but we still need to load it again for *b += *x; in case the data pointed by x is changed–it is exactly what happens when a == x (a and x point to the same memory address).

However, decorating a and x with __restrict__ will tell the compiler they are different. As a result, the compiler believes that *x will not be changed in f2, so it will only load it once.

The PyTorch Bug

By running the fuzzer NeuRI, we find a bug which can be triggered by the code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch

p0 = torch.tensor([[4.9334, 5.5571]]) # (1, 2)

def fn():
v7 = torch.cat([p0, p0], dim=0) # v7: (2, 2)
v1 = torch.mul(v7, v7) # v1: (2, 2)
return v7, v1

ret_eager = fn()
compiled = torch.compile(fn)
ret_compiled = compiled()

assert torch.allclose(ret_eager[0], ret_compiled[0])
# ^^^ no error
assert torch.allclose(ret_eager[1], ret_compiled[1])
''' ^^^ WRONG!
AssertionError:
ret_eager[1] = tensor([[24.3384, 30.8814],
[24.3384, 30.8814]])
ret_compiled[1] = tensor([[0., 0.],
[0., 0.]])
'''

As you can see, fn is composed by two tensor operations. After compilation, it gives wrong results for the second return value v1. (All values in v1 are zeros, which is incorrect.)

What torch.compile does is that, it generates a C++ kernel function to compute fn. We add some comments to help to understand how the C++ function implements the Python function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
extern "C" void kernel(const float* __restrict__ in_ptr0, // p0
const float* __restrict__ in_ptr1, // p0
const float* __restrict__ in_ptr2, // v7
float* __restrict__ out_ptr0, // first half of v7
float* __restrict__ out_ptr1, // last half of v7
float* __restrict__ out_ptr2) // v1
{
{ // 1st part of cat operation: copy values in p0 to the first half of v7
#pragma GCC ivdep
for(long i0=0; i0<2; i0+=1)
{
auto tmp0 = in_ptr0[i0];
out_ptr0[i0] = tmp0;
}
}
{ // 2nd part of cat operation: copy values in p0 to the last half of v7
#pragma GCC ivdep
for(long i0=0; i0<2; i0+=1)
{
auto tmp0 = in_ptr1[i0];
out_ptr1[i0] = tmp0;
}
}
{ // mul operation: v1 <- element-wise multiplication of v7 and v7
#pragma GCC ivdep
for(long i0=0; i0<4; i0+=1)
{
auto tmp0 = in_ptr2[i0];
auto tmp1 = tmp0 * tmp0;
out_ptr2[i0] = tmp1;
}
}
}

As you see, it uses __restrict__ for all pointer arguments. It indicates that they are different. But actually, they are NOT. in_ptr2 points to the low-level memory address of tensor v7, while out_ptr0 points to the first half of v7 and out_ptr1 points to the last half one. They are overlapped.

The values of v7 are changed by writing to addresses referred by out_ptr0 and out_ptr1 in the first two for loops. So, for reading data of v7 by in_ptr2 in the last for loop, it should load the values after writing to out_ptr0 and out_ptr1. If it loads them before, it should reload it to ensure the correctness. Otherwise, old values stored in v7 will be used to do the multiplication. I guess that’s why the compiled function gives zeros.

Finally, the developers fixed this bug by removing the usage of __restrict__ keywords in code generation.

However, I could not reproduce the wrong behavior led by __restrict__ at the assembly level. I tried to compile the cpp function above by clang-14 -c -g -O3 k.cpp -o k.o && llvm-objdump-14 -S k.o and got the assembly code below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
0000000000000000 <kernel>:
; out_ptr0[i0] = tmp0;
0: 48 8b 07 movq (%rdi), %rax
3: 48 89 01 movq %rax, (%rcx)
; out_ptr1[i0] = tmp0;
6: 48 8b 06 movq (%rsi), %rax
9: 49 89 00 movq %rax, (%r8)
c: 31 c0 xorl %eax, %eax
e: 66 90 nop
; auto tmp0 = in_ptr2[i0];
10: f3 0f 10 04 82 movss (%rdx,%rax,4), %xmm0 # xmm0 = mem[0],zero,zero,zero
; auto tmp1 = tmp0 * tmp0;
15: f3 0f 59 c0 mulss %xmm0, %xmm0
; out_ptr2[i0] = tmp1;
19: f3 41 0f 11 04 81 movss %xmm0, (%r9,%rax,4)
; for(long i0=0; i0<4; i0+=1)
1f: 48 83 c0 01 addq $1, %rax
23: 48 83 f8 04 cmpq $4, %rax
27: 75 e7 jne 0x10 <kernel+0x10>
; }
29: c3 retq

I think it works correctly since it loads the operands for multiplication by movss (%rdx,%rax,4), %xmm0 which means reading values from memory. It doesn’t load the value first and then use the old data. So I don’t clearly know why the compiled function in PyTorch gives wrong results. We can only say that __restrict__ should not be used there by its definition.

下面是一段脱口秀的节选。请分析演讲者对两条狗的评价是怎样的,以及在此使用形容解放军的“作风优良,能打胜仗”八个字的用意是什么。

我去上海以后,就收养了两条流浪狗。严格来讲都不是流浪狗。我们是从家附近的一座山上,捡回来的两条野狗。我们都不算救助他们,因为他们在山上那个食物链的地位完全不需要我们的救助。我们只是,相当于《变形计》,可能是,体验一下城里的生活。因为这两条狗,他们确实在山上就是食物链的顶端。我看到他的第一时间,我感觉我不是在看两条狗,是动物世界的拍摄现场。就两条狗追一个松鼠,像炮弹一样发射出去。就你平时看到一些狗,你会觉得很萌,心都要被融化了,你会想到这些词。我看到这两条狗的内心,只闪烁出了 8 个大字:“作风优良,能打胜仗”,非常优秀。我牵着这两条狗,在上海的街头威风凛凛。唯一的问题就是,他们的运动量太大,我的身体素质有点跟不上。

ChatGPT (May 12 Version):

从这段脱口秀的节选中,演讲者对这两条狗的评价可以总结为以下几点:

动物世界的顶端:演讲者提到这两条狗在山上是食物链的顶端,暗示它们在野外生存能力强,狩猎能力出众。

炮弹般的速度:演讲者形容这两条狗追逐松鼠时像炮弹一样发射出去,暗示它们的速度非常快。

作风优良,能打胜仗:演讲者通过这八个字形容这两条狗的内心,表达出对它们品质的肯定。这句话的使用是一种夸张和幽默的修辞手法,意在强调这两条狗的出色表现。

通过以上评价,可以看出演讲者对这两条狗的态度是积极赞赏的。他们被描绘为强大、敏捷和具有优良的品质,给人留下深刻的印象。

此外,在使用形容解放军的“作风优良,能打胜仗”这八个字的用意上,演讲者借用了这个描述来夸大和夸奖这两条狗的品质。这个用法是一种幽默和夸张的比喻,通过将动物的行为与解放军的作风相联系,以引起观众的共鸣和笑声。它也暗示了演讲者对解放军的赞赏,将这种赞赏用在形容两条狗的表达上,起到了加强幽默效果的作用。

请问演讲者用评价解放军的词来评价野狗,这其中是否有用野狗侮辱解放军的意思?

ChatGPT (May 12 Version):

根据提供的信息,演讲者在描述这两条野狗时使用了形容解放军的“作风优良,能打胜仗”这八个字。这里需要理解演讲者的表达意图和语境。从描述中可以看出,演讲者对这两条野狗的评价是积极的,将它们的品质和解放军的作风进行了比较。

虽然使用了解放军的形容词来描述野狗,但并没有明确表达对解放军的侮辱或贬低的意思。相反,这种比较是一种修辞手法,通过将动物的行为与解放军的作风联系起来,强调野狗的出色表现和品质。这是一种幽默和夸张的比喻,旨在引起观众的共鸣和笑声,而并非用于侮辱或贬低解放军。

需要注意的是,这种语言表达的解释也受到个人观点和理解的影响。因此,不同人可能对演讲者使用这种比喻的用意有不同的解读。

Reference: 【現場錄音】House李昊石脫口秀因這八個字被停演批鬥。

Insert the breakpoint

There’s a useful trick for efficiently debugging Python code. Say if you have a loop like the one below, how to interactively access the list l at each loop step?

1
2
3
l = []
for i in range(5):
l.append(i)

You can insert a “breakpoint” as follows.

1
2
3
4
5
6
# test.py
from IPython import embed # pip install ipython
l = []
for i in range(5):
l.append(i)
embed()

Then you run python test.py in the shell, and an interactive environment will be prompted out like this:

1
2
3
4
5
6
colin ❯ python test.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

The program stops at the position of embed(), and you can access variables visible at this point, like:

1
2
In [1]: l
Out[1]: [0]

You can also execute most kinds of Python code here, like:

1
2
3
4
5
6
7
In [2]: l.append(100)

In [3]: l
Out[3]: [0, 100] # l is changed!

In [4]: import random; print(random.random())
0.42541864192778645

You can use quit to continue running the program, and the program will stop at the next breakpoint if there’s any. Ctrl+D is equivalent to this.

1
2
3
4
5
6
7
8
In [5]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 100, 1]

Exit the program

Sometimes you insert embed() inside a loop that will repeat many times, and you just want to exit running the program at some time. But you will find both Ctrl+C and Ctrl+D do not work here, so we can just close the shell and the process will be terminated. :)

Closing the shell works, but we have something better. Say now you are in the interactive environment provided by embed. You can press Ctrl+Z here, and then you are back to your shell and see something like below.

1
2
3
4
In [2]: # press Ctrl+Z here!

[1] + 4810 suspended python test.py
colin ❯ # we are back to the shell!

BUT we’re NOT done yet! Ctrl+Z just sends the “terminal stop” signal (SIGTSTP) to the foreground running process. The process will not take any more CPU resources, but it still occupies memory and ISN’T dead yet. You can even use fg to bring it back!

1
2
3
colin ❯ fg
[1] - 4810 continued ipython
In [2]: # you can continue to use the interactive environment here

To terminate the process completely, you need to use kill -9 command, which sends a SIGKILL signal indicating to a service to shutdown immediately. In our case, you can execute kill -9 %1 to terminate the process just suspended by Ctrl+Z.

%1 means “job number 1” in the current shell. You can run jobs to list all jobs in the current shell, like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
colin ❯ ipython # run ipython in the shell first
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.


[1] + 6862 suspended ipython # Ctrl+Z to suspend ipython
colin ❯ python # then run python in the shell
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
[2] + 6879 suspended python # Ctrl+Z to suspend python
colin ❯ jobs # list all jobs in the current shell
[1] - suspended ipython # can be terminated with kill -9 %1
[2] + suspended python # can be terminated with kill -9 %2

Combining Ctrl+Z and kill -9 is also useful for stopping a process immediately. Because sometimes, after Ctrl+C, the process will do some post-processing which may take a long time. Then you can use this way to stop it right now.

Deactivate the current embed

Thanks to my friend @Leo‘s reminder, we can use %kill_embedded in the ipython interactive environment to deactivate the current embed() but keep others working. For example, in the program below, after stopping at the embed() in the first loop for 2 times, we can do %kill_embedded with confirming it and quit to skip the remaining ones in the first loop, while the embed() in the second loop still works so we will stop there.

1
2
3
4
5
6
7
8
9
10
11
12
13
# test2.py
from IPython import embed

l = []
for i in range(4):
l.append(i)
embed()

print('==== finish the first loop! ====')

for i in range(4, 8):
l.append(i)
embed()

Execuation log in shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
colin ❯ python test2.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0]

In [2]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 1]

In [2]: %kill_embedded <<<<---- deactivate this embed
Namespace(instance=False, exit=False, yes=False)
Are you sure you want to kill this embedded call_location? [y/N] y <<<<---- confirm the deactivation
This embedded IPython call location will not reactivate anymore once you exit.

In [3]: l
Out[3]: [0, 1]

In [4]: quit <<<<---- quit the second stop

==== finish the first loop! ==== <<<<---- embed in the first loop will not work any more and we directly reach here
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l <<<<---- stop at the embed in the second loop
Out[1]: [0, 1, 2, 3, 4]

In [2]: quit

References:

What is effect of CTRL + Z on a unix\Linux application
https://superuser.com/questions/275433/what-does-1-in-kill-1-mean

0%