Summary of DeepSeek R1

If the math equations are not rendered correctly, please forcely refresh the page.

Input and Output

https://arxiv.org/abs/2501.12948

They instruct the model to give responses in the following format:

1
2
3
4
5
6
<think>
To solve this problem, we can first ...
</think>
<answer>
The anwser to this question is ...
</answer>

In this way, they clearly separate the CoT reasoning process from the answer, so we can easily extract each part.

Learning: RL and GRPO

Background

Recall the RL pipeline:

  • Each time, we get a user input from the dataset and sample a response from the model being optimized.
  • The response is evaluated by a “reward model” indicating how good it is, i.e., the scalar reward value.
  • The reward value is used to compute the loss function/gradient to optimize the model.
    • $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
    • If the reward value is high, we maximize the probability of producing the response.
https://huggingface.co/blog/rlhf

PPO is a popular RL algorithm which is widely adopted before the LLM-era on many tasks. From a practical perspective, it is compute-expensive:

  • Policy model: the LLM we want to optmize
  • Reward model: the LLM predicting how good an answer is to a given question; it is fixed during RL/PPO training
  • Critic model: the LLM predicting how good a “state” is in the RL setting
    • Understanding this requires a deeper dive into the PPO algorithm.
    • It is not “fixed”; it is optimized along with the policy model during the RL/PPO training
  • Reference model: the initial SFT model used to compute the KL loss

Motivation

We want to reduce the cost of RL training:

  • Reward model: in some cases, the answer to a question is very easy to verify without the need of a neural network (rule-based reward)
    • Math: we can just compare the numbers (except for the proof problems)
    • Code: we can execute the solution code against the test cases
  • Critic model: in general, we just want to compute the policy gradient: $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
    • There are many different policy gradient methods (many different ways to compute the policy gradient). They differ in the way to estimate $Q^\pi (s, a)$ .
      • Monte Carlo: we can perform rollouts to estimate how good $(s, a)$ is, similar to MCTS, where we do not need any neural model.
        • $Q^\pi (s_t, a_t) = \mathbb{E}_\pi[ G_t | s_t, a_t ], ~ G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots$
      • Neural network learning: we learn a value function $V(s)$ to predict how good a state $s$ is and then we can use it to calculate $Q^\pi (s, a)$.
        • Temporal Difference: $\text{TD error} = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$ , where $V(s)$ is estimated by the critic model.

Therefore, if we focus on “easy-to-verify” problems, we can only have the policy model and the reference model to do RL training.

Technique: GRPO

https://arxiv.org/abs/2501.12948

If we ignore the clipping tricks used to stablize the training, the objective can be simplified as:

$$ \mathcal{J}(\theta) = {1 \over G} \sum_{i=1}^G A_i \pi_{\theta} (o_i | q) - \beta \mathbb{D}_{KL}(\pi_\theta || \pi_{\text{ref}}) $$
  • It is very similar to the general PG equation, as it is also a variant of PG.
    • $A$, advantage, can be consdered as an alternative of $Q$ with lower variance by taking off a baseline value.
      • The baseline here is estimated by the mean reward of a set of response candidates to a given question.
    • KL is used to prevent the model from going too far away from the original one.
      • The model after RL is still a language model so it should not be so different from the initial one.
    • The clipping tricks is the same as the ones in PPO.
  • From a practice perspective, we only need to run two models (at least) during training.
  • total reward = format reward + accuracy reward
  • Compared to SFT: the LLM learns to maximize the reward by learning from explorations, rather than imitating existing data or imitating the behavor of a stronger model.
https://arxiv.org/pdf/2402.03300

The Entire Training: A Mixture of Multiple Techniques and Stages

grpo.drawio

Discussions

The “aha moment”

https://arxiv.org/abs/2501.12948

Where do the capabilities of search, verification, and backtracking come from?

  • These patterns must already exist in the base model. The RL training amplifies it.

  • In particular, researchers at Sea AI Lab (SAIL) showed that base models can be easily prompted to produce self-reflection and that the “aha” moment from the DeepSeek-R1 paper may be more a symptom of the base model than the RL optimisation process. (src)

Is GRPO the most effective solution?

Definitely not.

  • Using RL to improve LLM reasoning capability has become the “common sense” among “insiders” for at least one year. OpenAI’s o1 model is the first well-known and released reasoning model based on RL.
  • GRPO is just a variant of PG. Compared with PPO, it takes less resources. But there are also other similar variants which are also “cheap” (use MC to estimate Q).
    • People are using other variants for some reasons. For example, REINFORCE++ has a very similar loss function, removing / std in GRPO and using K1 KL loss (GRPO uses K3 KL loss) to make it more stable.
    • Some people say PPO is more stable during training.

Is RL all you need?

Not yet.

  • Many capabilities are encoded into the LLMs during pre-training. They do not train a LLM from scratch using RL.
  • There are lots of SFT related processes in the training process of R1, because we want a LLM which can do more things than coding and math problem solving.
    • An open question: how do the capabilities gained from rule-based RL training generalize to general tasks?
      • Even for math problems, there is no rule-based reward system to evaluate a proof.