Introduction to Code LLMs Training
If the math equations are not rendered correctly, please forcely refresh the page.
Comments for group sharing on 2025-02-13: we will quickly review the pre-training and post-training of LLMs considering a broader audience. Then we will move to RLHF and deepseek related topics. We will not cover detailed RL algorithms and their mathematical formulations as they take lots of time to explain and understand.
Pretraining
Summary:
- Goal: To model the language distribution in the wild.
- Data: Existing data collected from the Internet, e.g., open-source code from GitHub Repos.
- Loss function: next-token prediction, cross-entropy loss: $-\sum_{c=1}^M y_c \log p_c$ .
Literatures:
- OLMo: Fully open-sourced LLM pre-training from UW/AI2.
Post-training
Summary:
Motivation: Pre-trained LLMs are encoded with massive knowledge. However, it cannot “chat” with users smoothly. The pre-training objective endows the model with the capability of continuing a given text, rather than chatting and anwering a question.
An example of a base model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20from transformers import pipeline
pp = pipeline(model='allenai/OLMoE-1B-7B-0125')
while True:
inp = input("Input: ")
out = pp(inp)
print(out)
print('=' * 32)
'''
Input: The most exciting AI advancement
[{'generated_text': ' in the last few years has been the rise of deep learning. Deep learning is a type of machine'}]
================================
Input: How are you?
[{'generated_text': '\n\nHow are you?\n\nHow are you?\n\nHow are you?\n\n'}]
================================
Input: Please help me solve this equation, x^2 + 2x + 1 = 4 .
[{'generated_text': '\n\n2. Originally Posted by johnsy123\nSolve the equation, x^2 +'}]
'''An example of an instruct model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23from transformers import pipeline
pp = pipeline(model='allenai/OLMoE-1B-7B-0125-Instruct', return_full_text=False)
while True:
inp = input("Input: ")
out = pp([{
'role': 'user',
'content': inp,
}])
print(out)
print('=' * 32)
'''
Input: The most exciting AI advancement
[{'generated_text': 'Determining the "most exciting AI advancement" is highly subjective and depends on various factors, including personal'}]
================================
Input: How are you?
[{'generated_text': "As an AI, I don't have feelings or personal experiences, but I'm functioning correctly and ready"}]
================================
Input: Please help me solve this equation, x^2 + 2x + 1 = 4 .
[{'generated_text': 'To solve the equation \\(x^2 + 2x + 1 = 4\\), we first simplify'}]
'''
Goal:
- Alignment:
- Functionality: follow the instructions by users; chat with users in a friendly way
- Safety: avoid generating harmful content
- Performance enhancement: specialize the LLMs in specific domains, such as math, coding, medical QA, etc.
- Alignment:
Literatures:
- Tulu 3: Fully open-sourced LLM post-training from UW/AI2.
Supervised Fine-tuning
Summary:
Loss function: (still) next-token prediction, cross-entropy loss.
Data:
Format: “chat format”
Chat template associated with a model and its tokenizer:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49from transformers import AutoTokenizer
input_message = 'Hello, how are you?'
input_chat = [
{
'role': 'user',
'content': input_message,
}
]
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3.1-8B-Instruct')
print(tokenizer.get_chat_template())
applied_msgs = tokenizer.apply_chat_template(input_chat, tokenize=False)
print(applied_msgs)
'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello, how are you?<|eot_id|>
'''
completed_chat = [
{
'role': 'user',
'content': input_message,
},
{
'role': 'assistant',
'content': 'I am fine, thank you!',
}
]
print(tokenizer.apply_chat_template(completed_chat, tokenize=False))
'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024
<|eot_id|><|start_header_id|>user<|end_header_id|>
Hello, how are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
I am fine, thank you!<|eot_id|>
'''
Sources: real-world data (possibly with some transformations) or synthetic data
Synthetic Data Generation
- Motivation: Existing Internet data is exhausted. We need to create new data to improve the LLMs.
- Technique:
- For a specific domain we want to optimize, we want to generate a SFT dataset $D$ = { (user_message, model_response) } .
- Select an existing model $M_g$ to generate $D$ .
- Use $D$ to SFT a model $M_s$ to get a better one: $M_s’$ .
- Discussion:
- When $M_g$ is a more powerful model than $M_s$ , we are actually distilling the capabilities of $M_g$ into $M_s$ .
- Off-policy data: $D$ comes from $M_g$ , so it is in the distribution of $M_g$ , but off the distribution of $M_s$ .
- On-policy data: $M_g = M_s$ .
- Self-Instruct: generate $D$ from $M_s$ , apply some filtering criteria to get $D’ \subseteq D$ , and uses it to SFT $M_s$ to get $M_s’$ .
- Sometimes self-instruct can be better because it uses on-policy data.
- When $M_g$ is a more powerful model than $M_s$ , we are actually distilling the capabilities of $M_g$ into $M_s$ .
- Challenges:
- How to create diverse data covering as many cases as possible?
- How to verify the quality of generated data?
Literatures:
- OSS-Instruct: generating synthetic coding data for SFT
- Motivation: if you just prompt the LLMs to generate coding tasks, e.g. leetcode-like problems with reference solutions, it cannot always produce new (problem, solution) pairs.
- Related issue: model collapsing
- Techniques: Inject some “noises” into the generation process to diversify the generated content.
- Randomly select open-source code snippets from the Internet data.
- Instruct a LLM to design a coding task inspired from the randomly selected code snippet. Also, generate a corresponding solution for the coding task.
- Motivation: if you just prompt the LLMs to generate coding tasks, e.g. leetcode-like problems with reference solutions, it cannot always produce new (problem, solution) pairs.
Common Practice: synthetic data generation + rejection sampling + SFT
- Synthetic data generation: OSS-Instruct
- Rejection sampling: use some oracles to verify the quality of the generated data.
- Example of NL2Code: also generate test cases in data synthesis: (problem, solution, test cases), and then run the code against the test cases; only select those tripples where the solutions can pass the test cases.
- Example of execution reasoning: Given a program and an input, predict the output value.
- Instruct an existing model to generate the CoT reasoning process and the predicted output value.
- Only select the responses where the predicted output values are matched with the ground truth (we can execute the programs on the inputs to get the ground truth output values).
- Model training: do SFT on a base model
- Evaluation:
- NL2Code: EvalPlus (HumanEval, MBPP), LiveCodeBench, etc.
RLHF
RL

Summary:
- Formulation:
- The agent lives in an environment with an initial state $s_0$ .
- The agent performs an action $a_0$ based on its policy $\pi(a | s)$ .
- The action leads to the environment change, i.e. a state transition from $s_0$ to $s_1$ .
- By performing $a_0$ , the agent can receive a reward $r_0$ .
- Learning objective:
- RL optimizes the policy to maximize the total reward.
- v.s. SFT: optimizes the “policy” so it is similar to another “policy”.
- Loss function: Policy gradient: $\nabla_{\theta} J(\theta) = \mathbb{E} [ Q^\pi (s, a) \nabla_\theta \ln \pi_\theta (a|s) ]$
- Intuitive understanding: when following the policy $\pi$ , if $(s, a)$ is good, then we maximize its log probability; if $(s, a)$ is bad, then we minimize its log probability.
- RL optimizes the policy to maximize the total reward.
RLHF
- Reward Model:
r = RewardModel( (user_input, model_response) ); assert isinstance(r, float);
response_candidates: List[str] = SFTModel( user_input )
candidate_scores: List[float] = [ HumanAnnotator( (user_input, candidate_response) ) for candidate_response in response_candidates ]
RewardModel = SFT( InitRewardModel, response_candidates, candidate_scores )
- RLHF Training:
- Sample a user prompt from datasets.
- SFTed Model generates a response.
- The response is scored by the reward model.
- The reward signal is used to optimize the SFTed model.

