Summary of DeepSeek R1

Posted on 2025-02-12 Edited on 2025-10-15 In research

If the math equations are not rendered correctly, please forcely refresh the page.

Input and Output

They instruct the model to give responses in the following format:

<think>
To solve this problem, we can first ...
</think>
<answer>
The anwser to this question is ...
</answer>

In this way, they clearly separate the CoT reasoning process from the answer, so we can easily extract each part.

Recall the RL pipeline:

Each time, we get a user input from the dataset and sample a response from the model being optimized.
The response is evaluated by a “reward model” indicating how good it is, i.e., the scalar reward value.
The reward value is used to compute the loss function/gradient to optimize the model.
- $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
- If the reward value is high, we maximize the probability of producing the response.

PPO is a popular RL algorithm which is widely adopted before the LLM-era on many tasks. From a practical perspective, it is compute-expensive:

Policy model: the LLM we want to optmize
Reward model: the LLM predicting how good an answer is to a given question; it is fixed during RL/PPO training
Critic model: the LLM predicting how good a “state” is in the RL setting
- Understanding this requires a deeper dive into the PPO algorithm.
- It is not “fixed”; it is optimized along with the policy model during the RL/PPO training
Reference model: the initial SFT model used to compute the KL loss

We want to reduce the cost of RL training:

Reward model: in some cases, the answer to a question is very easy to verify without the need of a neural network (rule-based reward)
- Math: we can just compare the numbers (except for the proof problems)
- Code: we can execute the solution code against the test cases
Critic model: in general, we just want to compute the policy gradient: $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
- There are many different policy gradient methods (many different ways to compute the policy gradient). They differ in the way to estimate $Q^\pi (s, a)$ .
  - Monte Carlo: we can perform rollouts to estimate how good $(s, a)$ is, similar to MCTS, where we do not need any neural model.
    - $Q^\pi (s_t, a_t) = \mathbb{E}_\pi[ G_t | s_t, a_t ], ~ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$
  - Neural network learning: we learn a value function $V(s)$ to predict how good a state $s$ is and then we can use it to calculate $Q^\pi (s, a)$.
    - Temporal Difference: $\text{TD error} = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ , where $V(s)$ is estimated by the critic model.

Therefore, if we focus on “easy-to-verify” problems, we can only have the policy model and the reference model to do RL training.

If we ignore the clipping tricks used to stablize the training, the objective can be simplified as:

$$ \mathcal{J}(\theta) = {1 \over G} \sum_{i=1}^G A_i \pi_{\theta} (o_i | q) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{\text{ref}}) $$

It is very similar to the general PG equation, as it is also a variant of PG.
- $A$, advantage, can be consdered as an alternative of $Q$ with lower variance by taking off a baseline value.
  - The baseline here is estimated by the mean reward of a set of response candidates to a given question.
- KL is used to prevent the model from going too far away from the original one.
  - The model after RL is still a language model so it should not be so different from the initial one.
- The clipping tricks is the same as the ones in PPO.
From a practice perspective, we only need to run two models (at least) during training.
total reward = format reward + accuracy reward
Compared to SFT: the LLM learns to maximize the reward by learning from explorations, rather than imitating existing data or imitating the behavor of a stronger model.

Where do the capabilities of search, verification, and backtracking come from?

These patterns must already exist in the base model. The RL training amplifies it.
In particular, researchers at Sea AI Lab (SAIL) showed that base models can be easily prompted to produce self-reflection and that the “aha” moment from the DeepSeek-R1 paper may be more a symptom of the base model than the RL optimisation process. (src)

Definitely not.

Using RL to improve LLM reasoning capability has become the “common sense” among “insiders” for at least one year. OpenAI’s o1 model is the first well-known and released reasoning model based on RL.
GRPO is just a variant of PG. Compared with PPO, it takes less resources. But there are also other similar variants which are also “cheap” (use MC to estimate Q).
- People are using other variants for some reasons. For example, REINFORCE++ has a very similar loss function, removing / std in GRPO and using K1 KL loss (GRPO uses K3 KL loss) to make it more stable.
- Some people say PPO is more stable during training.

Not yet.

Many capabilities are encoded into the LLMs during pre-training. They do not train a LLM from scratch using RL.
There are lots of SFT related processes in the training process of R1, because we want a LLM which can do more things than coding and math problem solving.
- An open question: how do the capabilities gained from rule-based RL training generalize to general tasks?
  - Even for math problems, there is no rule-based reward system to evaluate a proof.