Summary of DeepSeek R1

Posted on 2025-02-12 Edited on 2025-10-15 In research

If the math equations are not rendered correctly, please forcely refresh the page.

Input and Output

They instruct the model to give responses in the following format:

<think>
To solve this problem, we can first ...
</think>
<answer>
The anwser to this question is ...
</answer>

In this way, they clearly separate the CoT reasoning process from the answer, so we can easily extract each part.

Learning: RL and GRPO

Background

Recall the RL pipeline:

Each time, we get a user input from the dataset and sample a response from the model being optimized.
The response is evaluated by a “reward model” indicating how good it is, i.e., the scalar reward value.
The reward value is used to compute the loss function/gradient to optimize the model.
- $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
- If the reward value is high, we maximize the probability of producing the response.

PPO is a popular RL algorithm which is widely adopted before the LLM-era on many tasks. From a practical perspective, it is compute-expensive:

Policy model: the LLM we want to optmize
Reward model: the LLM predicting how good an answer is to a given question; it is fixed during RL/PPO training
Critic model: the LLM predicting how good a “state” is in the RL setting
- Understanding this requires a deeper dive into the PPO algorithm.
- It is not “fixed”; it is optimized along with the policy model during the RL/PPO training
Reference model: the initial SFT model used to compute the KL loss

Motivation

We want to reduce the cost of RL training:

Reward model: in some cases, the answer to a question is very easy to verify without the need of a neural network (rule-based reward)
- Math: we can just compare the numbers (except for the proof problems)
- Code: we can execute the solution code against the test cases
Critic model: in general, we just want to compute the policy gradient: $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
- There are many different policy gradient methods (many different ways to compute the policy gradient). They differ in the way to estimate $Q^\pi (s, a)$ .
  - Monte Carlo: we can perform rollouts to estimate how good $(s, a)$ is, similar to MCTS, where we do not need any neural model.
    - $Q^\pi (s_t, a_t) = \mathbb{E}_\pi[ G_t | s_t, a_t ], ~ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$
  - Neural network learning: we learn a value function $V(s)$ to predict how good a state $s$ is and then we can use it to calculate $Q^\pi (s, a)$.
    - Temporal Difference: $\text{TD error} = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ , where $V(s)$ is estimated by the critic model.

Therefore, if we focus on “easy-to-verify” problems, we can only have the policy model and the reference model to do RL training.

Technique: GRPO

If we ignore the clipping tricks used to stablize the training, the objective can be simplified as:

$$ \mathcal{J}(\theta) = {1 \over G} \sum_{i=1}^G A_i \pi_{\theta} (o_i | q) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{\text{ref}}) $$

It is very similar to the general PG equation, as it is also a variant of PG.
- $A$, advantage, can be consdered as an alternative of $Q$ with lower variance by taking off a baseline value.
  - The baseline here is estimated by the mean reward of a set of response candidates to a given question.
- KL is used to prevent the model from going too far away from the original one.
  - The model after RL is still a language model so it should not be so different from the initial one.
- The clipping tricks is the same as the ones in PPO.
From a practice perspective, we only need to run two models (at least) during training.
total reward = format reward + accuracy reward
Compared to SFT: the LLM learns to maximize the reward by learning from explorations, rather than imitating existing data or imitating the behavor of a stronger model.

The Entire Training: A Mixture of Multiple Techniques and Stages

Discussions

The “aha moment”

Where do the capabilities of search, verification, and backtracking come from?

These patterns must already exist in the base model. The RL training amplifies it.
In particular, researchers at Sea AI Lab (SAIL) showed that base models can be easily prompted to produce self-reflection and that the “aha” moment from the DeepSeek-R1 paper may be more a symptom of the base model than the RL optimisation process. (src)

Is GRPO the most effective solution?

Definitely not.

Using RL to improve LLM reasoning capability has become the “common sense” among “insiders” for at least one year. OpenAI’s o1 model is the first well-known and released reasoning model based on RL.
GRPO is just a variant of PG. Compared with PPO, it takes less resources. But there are also other similar variants which are also “cheap” (use MC to estimate Q).
- People are using other variants for some reasons. For example, REINFORCE++ has a very similar loss function, removing / std in GRPO and using K1 KL loss (GRPO uses K3 KL loss) to make it more stable.
- Some people say PPO is more stable during training.

Is RL all you need?

Not yet.

Many capabilities are encoded into the LLMs during pre-training. They do not train a LLM from scratch using RL.
There are lots of SFT related processes in the training process of R1, because we want a LLM which can do more things than coding and math problem solving.
- An open question: how do the capabilities gained from rule-based RL training generalize to general tasks?
  - Even for math problems, there is no rule-based reward system to evaluate a proof.

Introduction to Code LLMs Training

Posted on 2025-02-12 Edited on 2025-10-15 In research

If the math equations are not rendered correctly, please forcely refresh the page.

Comments for group sharing on 2025-02-13: we will quickly review the pre-training and post-training of LLMs considering a broader audience. Then we will move to RLHF and deepseek related topics. We will not cover detailed RL algorithms and their mathematical formulations as they take lots of time to explain and understand.

Pretraining

Summary:

Goal: To model the language distribution in the wild.
Data: Existing data collected from the Internet, e.g., open-source code from GitHub Repos.
Loss function: next-token prediction, cross-entropy loss: $-\sum_{c=1}^M y_c \log p_c$ .

Literatures:

OLMo: Fully open-sourced LLM pre-training from UW/AI2.

Post-training

Summary:

Motivation: Pre-trained LLMs are encoded with massive knowledge. However, it cannot “chat” with users smoothly. The pre-training objective endows the model with the capability of continuing a given text, rather than chatting and anwering a question.

An example of a base model:

from transformers import pipeline

pp = pipeline(model='allenai/OLMoE-1B-7B-0125')

while True:
    inp = input("Input: ")
    out = pp(inp)
    print(out)
    print('=' * 32)

'''
Input: The most exciting AI advancement
[{'generated_text': ' in the last few years has been the rise of deep learning. Deep learning is a type of machine'}]
================================
Input: How are you?
[{'generated_text': '\n\nHow are you?\n\nHow are you?\n\nHow are you?\n\n'}]
================================
Input: Please help me solve this equation, x^2 + 2x + 1 = 4 .
[{'generated_text': '\n\n2. Originally Posted by johnsy123\nSolve the equation, x^2 +'}]
'''

An example of an instruct model:

from transformers import pipeline

pp = pipeline(model='allenai/OLMoE-1B-7B-0125-Instruct', return_full_text=False)

while True:
    inp = input("Input: ")
    out = pp([{
        'role': 'user',
        'content': inp,
    }])
    print(out)
    print('=' * 32)

'''
Input: The most exciting AI advancement
[{'generated_text': 'Determining the "most exciting AI advancement" is highly subjective and depends on various factors, including personal'}]
================================
Input: How are you?
[{'generated_text': "As an AI, I don't have feelings or personal experiences, but I'm functioning correctly and ready"}]
================================
Input: Please help me solve this equation, x^2 + 2x + 1 = 4 .
[{'generated_text': 'To solve the equation \\(x^2 + 2x + 1 = 4\\), we first simplify'}]
'''

Goal:
- Alignment:
  - Functionality: follow the instructions by users; chat with users in a friendly way
  - Safety: avoid generating harmful content
- Performance enhancement: specialize the LLMs in specific domains, such as math, coding, medical QA, etc.

Literatures:

Tulu 3: Fully open-sourced LLM post-training from UW/AI2.

Supervised Fine-tuning

Summary:

Loss function: (still) next-token prediction, cross-entropy loss.

Data:

Format: “chat format”

Chat template associated with a model and its tokenizer:

from transformers import AutoTokenizer

input_message = 'Hello, how are you?'
input_chat = [
    {
        'role': 'user',
        'content': input_message,
    }
]

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
print(tokenizer.get_chat_template())

applied_msgs = tokenizer.apply_chat_template(input_chat, tokenize=False)
print(applied_msgs)
'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello, how are you?<|eot_id|>
'''

completed_chat = [
    {
        'role': 'user',
        'content': input_message,
    },
    {
        'role': 'assistant',
        'content': 'I am fine, thank you!',
    }
]
print(tokenizer.apply_chat_template(completed_chat, tokenize=False))
'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I am fine, thank you!<|eot_id|>
'''

Sources: real-world data (possibly with some transformations) or synthetic data

Synthetic Data Generation

Motivation: Existing Internet data is exhausted. We need to create new data to improve the LLMs.
Technique:
- For a specific domain we want to optimize, we want to generate a SFT dataset $D$ = { (user_message, model_response) } .
- Select an existing model $M_g$ to generate $D$ .
- Use $D$ to SFT a model $M_s$ to get a better one: $M_s’$ .
Discussion:
- When $M_g$ is a more powerful model than $M_s$ , we are actually distilling the capabilities of $M_g$ into $M_s$ .
  - Off-policy data: $D$ comes from $M_g$ , so it is in the distribution of $M_g$ , but off the distribution of $M_s$ .
- On-policy data: $M_g = M_s$ .
  - Self-Instruct: generate $D$ from $M_s$ , apply some filtering criteria to get $D’ \subseteq D$ , and uses it to SFT $M_s$ to get $M_s’$ .
  - Sometimes self-instruct can be better because it uses on-policy data.
Challenges:
- How to create diverse data covering as many cases as possible?
- How to verify the quality of generated data?

Literatures:

OSS-Instruct: generating synthetic coding data for SFT
- Motivation: if you just prompt the LLMs to generate coding tasks, e.g. leetcode-like problems with reference solutions, it cannot always produce new (problem, solution) pairs.
  - Related issue: model collapsing
- Techniques: Inject some “noises” into the generation process to diversify the generated content.
  - Randomly select open-source code snippets from the Internet data.
  - Instruct a LLM to design a coding task inspired from the randomly selected code snippet. Also, generate a corresponding solution for the coding task.

Common Practice: synthetic data generation + rejection sampling + SFT

Synthetic data generation: OSS-Instruct
Rejection sampling: use some oracles to verify the quality of the generated data.
- Example of NL2Code: also generate test cases in data synthesis: (problem, solution, test cases), and then run the code against the test cases; only select those tripples where the solutions can pass the test cases.
- Example of execution reasoning: Given a program and an input, predict the output value.
  - Instruct an existing model to generate the CoT reasoning process and the predicted output value.
  - Only select the responses where the predicted output values are matched with the ground truth (we can execute the programs on the inputs to get the ground truth output values).
Model training: do SFT on a base model
Evaluation:
- NL2Code: EvalPlus (HumanEval, MBPP), LiveCodeBench, etc.

RLHF

RL

https://en.wikipedia.org/wiki/Reinforcement_learning

Summary:

Formulation:
- The agent lives in an environment with an initial state $s_0$ .
- The agent performs an action $a_0$ based on its policy $\pi(a | s)$ .
- The action leads to the environment change, i.e. a state transition from $s_0$ to $s_1$ .
- By performing $a_0$ , the agent can receive a reward $r_0$ .
Learning objective:
- RL optimizes the policy to maximize the total reward.
  - v.s. SFT: optimizes the “policy” so it is similar to another “policy”.
- Loss function: Policy gradient: $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
  - Intuitive understanding: when following the policy $\pi$ , if $(s, a)$ is good, then we maximize its log probability; if $(s, a)$ is bad, then we minimize its log probability.

RLHF

Reward Model: r = RewardModel( (user_input, model_response) ); assert isinstance(r, float);
- response_candidates: List[str] = SFTModel( user_input )
- candidate_scores: List[float] = [ HumanAnnotator( (user_input, candidate_response) ) for candidate_response in response_candidates ]
- RewardModel = SFT( InitRewardModel, response_candidates, candidate_scores )
RLHF Training:
- Sample a user prompt from datasets.
- SFTed Model generates a response.
- The response is scored by the reward model.
- The reward signal is used to optimize the SFTed model.

https://openai.com/index/instruction-following

Server Usage Instructions

Posted on 2024-12-28 Edited on 2025-10-15 In skills

Better Shell Setup

This also installs some common dependencies if having sudo access. If not, comment out the sudo commands.

1
2
3

git clone https://github.com/Co1lin/ML-Env-Setup
cd ML-Env-Setup
bash 0_basic.sh

Environment Variables

Set these environment variables in your .bashrc and/or .zshrc.

1
2
3

export HF_HOME=<a shared folder on a fast storage>
export HF_TOKEN_PATH=<a personal folder>
export TMPDIR=<redirect this when /tmp is on / and the storage is limited>

Always monitor the status of GPUs when you plan to use or are using them. nvitop (pip install nvitop) is recommended.
Only use GPUs that are needed for your task! Do NOT blindly use all GPUs! Limit visible GPUs to your processes by export CUDA_VISIBLE_DEVICES=2,3, for example.
- Monitor the memory usage to help you decide how many GPUs to use.
- Monitor “GPU-Util”
  - to make sure the GPUs are not idle. If the GPU-Util keeps at 0%, you should suspect that your processes get stuck!
  - to help you optimize your code, if GPU-Util is low.
Make sure you don’t have any dead process occupying the GPUs! One way to exit your process: https://blog.co1in.me/skills/py-embed-trick/#Exit-the-program.
- Show dead processes that cannot be shown by nvidia-smi:
  1
  2
  # sudo needed to view processes of other users
  fuser -v /dev/nvidia*
Avoid using GPUs that are in use by other users! (Unless you are very sure it will not affect them by OOM or other reasons.)

https://blog.co1in.me/skills/py-embed-trick/#Exit-the-program
Create new repos with python-template.
Query LLMs with litellm, a unified interface for many LLM providers.

A PyTorch Bug Caused by The Misused C++ Keyword restrict

Posted on 2023-05-24 Edited on 2025-10-15 In research

When fuzzing the compilation stack in PyTorch 2.0 with NeuRI, we found some interesting bugs. Here we reveal one of them caused by the misused C++ keyword __restrict__.

The `restrict` Keyword in C++

First, let’s take a look at the effect of the __restrict__ keyword in C++ through a simple example below.

// test.cpp
void f1(int* a, int* b, int* x) {
    *a += *x;
    *b += *x;
}

void f2(int* __restrict__ a, int* __restrict__ b, int* __restrict__ x) {
    *a += *x;
    *b += *x;
}

The difference of the two functions is that, all the pointer arguments in f2 are decorated with __restrict__, while the ones in f1 are not. __restrict__ tells the compiler that all these pointers are unique, which means they will not refer to the same memory addresss. Then, the compiler can do some optimizations. Let’s see how the compiler optimize it through the assembly code.

We get the assembly code below by
1
2
clang-14 -c -g -O1 test.cpp -o test.o
llvm-objdump-14 -S test.o > test.S

; test.S
Disassembly of section .text:

0000000000000000 <_Z2f1PiS_S_>:
;     *a += *x;
       0: 8b 02                        	movl	(%rdx), %eax
       2: 01 07                        	addl	%eax, (%rdi)
;     *b += *x;
       4: 8b 02                        	movl	(%rdx), %eax
       6: 01 06                        	addl	%eax, (%rsi)
; }
       8: c3                           	retq
       9: 0f 1f 80 00 00 00 00         	nopl	(%rax)

0000000000000010 <_Z2f2PiS_S_>:
;     *a += *x;
      10: 8b 02                        	movl	(%rdx), %eax
      12: 01 07                        	addl	%eax, (%rdi)
;     *b += *x;
      14: 01 06                        	addl	%eax, (%rsi)
; }
      16: c3                           	retq

The first half of the assembly code corresponds to f1, while the remaining one corresponds to f2. We can see the only difference is that in assembly for f2, the second movl (%rdx), %eax instruction is omitted.

Normally, *b += *x will be compiled into two instructions in x86 like the assembly for f1. First, it needs to load *x from memory ((%rdx)) to a register (%eax), then it adds the value in this register to the data stored in memory, which is *b ((%rsi)). You may notice that *x is loaded in the first instruction for *a += *x;, but we still need to load it again for *b += *x; in case the data pointed by x is changed–it is exactly what happens when a == x (a and x point to the same memory address).

However, decorating a and x with __restrict__ will tell the compiler they are different. As a result, the compiler believes that *x will not be changed in f2, so it will only load it once.

The PyTorch Bug

By running the fuzzer NeuRI, we find a bug which can be triggered by the code below.

import torch

p0 = torch.tensor([[4.9334, 5.5571]]) # (1, 2)

def fn():
    v7 = torch.cat([p0, p0], dim=0) # v7: (2, 2)
    v1 = torch.mul(v7, v7) # v1: (2, 2)
    return v7, v1

ret_eager = fn()
compiled = torch.compile(fn)
ret_compiled = compiled()

assert torch.allclose(ret_eager[0], ret_compiled[0])
# ^^^ no error
assert torch.allclose(ret_eager[1], ret_compiled[1])
''' ^^^ WRONG!
AssertionError:
ret_eager[1] =    tensor([[24.3384, 30.8814],
                          [24.3384, 30.8814]])
ret_compiled[1] = tensor([[0., 0.],
                          [0., 0.]])
'''

As you can see, fn is composed by two tensor operations. After compilation, it gives wrong results for the second return value v1. (All values in v1 are zeros, which is incorrect.)

What torch.compile does is that, it generates a C++ kernel function to compute fn. We add some comments to help to understand how the C++ function implements the Python function.

extern "C" void kernel(const float* __restrict__ in_ptr0, // p0
                       const float* __restrict__ in_ptr1, // p0
                       const float* __restrict__ in_ptr2, // v7
                       float* __restrict__ out_ptr0, // first half of v7
                       float* __restrict__ out_ptr1, // last half of v7
                       float* __restrict__ out_ptr2) // v1
{
    { // 1st part of cat operation: copy values in p0 to the first half of v7
        #pragma GCC ivdep
        for(long i0=0; i0<2; i0+=1)
        {
            auto tmp0 = in_ptr0[i0];
            out_ptr0[i0] = tmp0;
        }
    }
    { // 2nd part of cat operation: copy values in p0 to the last half of v7
        #pragma GCC ivdep
        for(long i0=0; i0<2; i0+=1)
        {
            auto tmp0 = in_ptr1[i0];
            out_ptr1[i0] = tmp0;
        }
    }
    { //  mul operation: v1 <- element-wise multiplication of v7 and v7
        #pragma GCC ivdep
        for(long i0=0; i0<4; i0+=1)
        {
            auto tmp0 = in_ptr2[i0];
            auto tmp1 = tmp0 * tmp0;
            out_ptr2[i0] = tmp1;
        }
    }
}

As you see, it uses __restrict__ for all pointer arguments. It indicates that they are different. But actually, they are NOT. in_ptr2 points to the low-level memory address of tensor v7, while out_ptr0 points to the first half of v7 and out_ptr1 points to the last half one. They are overlapped.

The values of v7 are changed by writing to addresses referred by out_ptr0 and out_ptr1 in the first two for loops. So, for reading data of v7 by in_ptr2 in the last for loop, it should load the values after writing to out_ptr0 and out_ptr1. If it loads them before, it should reload it to ensure the correctness. Otherwise, old values stored in v7 will be used to do the multiplication. I guess that’s why the compiled function gives zeros.

Finally, the developers fixed this bug by removing the usage of __restrict__ keywords in code generation.

However, I could not reproduce the wrong behavior led by __restrict__ at the assembly level. I tried to compile the cpp function above by clang-14 -c -g -O3 k.cpp -o k.o && llvm-objdump-14 -S k.o and got the assembly code below.

0000000000000000 <kernel>:
;             out_ptr0[i0] = tmp0;
       0: 48 8b 07                     	movq	(%rdi), %rax
       3: 48 89 01                     	movq	%rax, (%rcx)
;             out_ptr1[i0] = tmp0;
       6: 48 8b 06                     	movq	(%rsi), %rax
       9: 49 89 00                     	movq	%rax, (%r8)
       c: 31 c0                        	xorl	%eax, %eax
       e: 66 90                        	nop
;             auto tmp0 = in_ptr2[i0];
      10: f3 0f 10 04 82               	movss	(%rdx,%rax,4), %xmm0    # xmm0 = mem[0],zero,zero,zero
;             auto tmp1 = tmp0 * tmp0;
      15: f3 0f 59 c0                  	mulss	%xmm0, %xmm0
;             out_ptr2[i0] = tmp1;
      19: f3 41 0f 11 04 81            	movss	%xmm0, (%r9,%rax,4)
;         for(long i0=0; i0<4; i0+=1)
      1f: 48 83 c0 01                  	addq	$1, %rax
      23: 48 83 f8 04                  	cmpq	$4, %rax
      27: 75 e7                        	jne	0x10 <kernel+0x10>
; }
      29: c3                           	retq

I think it works correctly since it loads the operands for multiplication by movss (%rdx,%rax,4), %xmm0 which means reading values from memory. It doesn’t load the value first and then use the old data. So I don’t clearly know why the compiled function in PyTorch gives wrong results. We can only say that __restrict__ should not be used there by its definition.

The `embed` Debugging Trick in Python and Ctrl+Z

Posted on 2023-05-16 Edited on 2025-10-15 In skills

Insert the breakpoint

There’s a useful trick for efficiently debugging Python code. Say if you have a loop like the one below, how to interactively access the list l at each loop step?

1
2
3

l = []
for i in range(5):
  l.append(i)

You can insert a “breakpoint” as follows.

# test.py
from IPython import embed # pip install ipython
l = []
for i in range(5):
  l.append(i)
  embed()

Then you run python test.py in the shell, and an interactive environment will be prompted out like this:

colin ❯ python test.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

The program stops at the position of embed(), and you can access variables visible at this point, like:

1 2	In [1]: l Out[1]: [0]

You can also execute most kinds of Python code here, like:

In [2]: l.append(100)

In [3]: l
Out[3]: [0, 100] # l is changed!

In [4]: import random; print(random.random())
0.42541864192778645

You can use quit to continue running the program, and the program will stop at the next breakpoint if there’s any. Ctrl+D is equivalent to this.

In [5]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 100, 1]

Exit the program

Sometimes you insert embed() inside a loop that will repeat many times, and you just want to exit running the program at some time. But you will find both Ctrl+C and Ctrl+D do not work here, so we can just close the shell and the process will be terminated. :)

Closing the shell works, but we have something better. Say now you are in the interactive environment provided by embed. You can press Ctrl+Z here, and then you are back to your shell and see something like below.

In [2]: # press Ctrl+Z here!

[1]  + 4810 suspended  python test.py
colin ❯ # we are back to the shell!

BUT we’re NOT done yet! Ctrl+Z just sends the “terminal stop” signal (SIGTSTP) to the foreground running process. The process will not take any more CPU resources, but it still occupies memory and ISN’T dead yet. You can even use fg to bring it back!

1
2
3

colin ❯ fg
[1]  - 4810 continued  ipython
In [2]: # you can continue to use the interactive environment here

To terminate the process completely, you need to use kill -9 command, which sends a SIGKILL signal indicating to a service to shutdown immediately. In our case, you can execute kill -9 %1 to terminate the process just suspended by Ctrl+Z.

%1 means “job number 1” in the current shell. You can run jobs to list all jobs in the current shell, like:

colin ❯ ipython # run ipython in the shell first
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.


[1]  + 6862 suspended  ipython # Ctrl+Z to suspend ipython
colin ❯ python # then run python in the shell
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
[2]  + 6879 suspended  python # Ctrl+Z to suspend python
colin ❯ jobs # list all jobs in the current shell
[1]  - suspended  ipython # can be terminated with kill -9 %1
[2]  + suspended  python # can be terminated with kill -9 %2

Combining Ctrl+Z and kill -9 is also useful for stopping a process immediately. Because sometimes, after Ctrl+C, the process will do some post-processing which may take a long time. Then you can use this way to stop it right now.

Deactivate the current `embed`

Thanks to my friend @Leo‘s reminder, we can use %kill_embedded in the ipython interactive environment to deactivate the current embed() but keep others working. For example, in the program below, after stopping at the embed() in the first loop for 2 times, we can do %kill_embedded with confirming it and quit to skip the remaining ones in the first loop, while the embed() in the second loop still works so we will stop there.

# test2.py
from IPython import embed

l = []
for i in range(4):
	l.append(i)
	embed()

print('==== finish the first loop! ====')

for i in range(4, 8):
	l.append(i)
	embed()

Execuation log in shell:

colin ❯ python test2.py
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0]

In [2]: quit

Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l
Out[1]: [0, 1]

In [2]: %kill_embedded    <<<<---- deactivate this embed
Namespace(instance=False, exit=False, yes=False)
Are you sure you want to kill this embedded call_location? [y/N]  y <<<<---- confirm the deactivation
This embedded IPython  call location will not reactivate anymore once you exit.

In [3]: l
Out[3]: [0, 1]

In [4]: quit    <<<<---- quit the second stop

==== finish the first loop! ====    <<<<---- embed in the first loop will not work any more and we directly reach here
Python 3.10.4 (main, Mar 31 2022, 03:38:35) [Clang 12.0.0 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: l    <<<<---- stop at the embed in the second loop
Out[1]: [0, 1, 2, 3, 4]

In [2]: quit

References:

What is effect of CTRL + Z on a unix\Linux application
https://superuser.com/questions/275433/what-does-1-in-kill-1-mean

Input and Output

Learning: RL and GRPO

Background

Motivation

Technique: GRPO

The Entire Training: A Mixture of Multiple Techniques and Stages

Discussions

The “aha moment”

Is GRPO the most effective solution?

Is RL all you need?

Pretraining

Post-training

Supervised Fine-tuning

Synthetic Data Generation

RLHF

RL

RLHF

Better Shell Setup

Environment Variables

NVIDIA GPU Related

Python Related

The __restrict__ Keyword in C++

The PyTorch Bug

Insert the breakpoint

Exit the program

Deactivate the current embed

The `restrict` Keyword in C++

Deactivate the current `embed`