【学习笔记】RL4LLM（三）-Linux大棚

admin 管理员组

文章数量: 1184232

目前字数上限是90000词，又溢出了

RL4LLM（一）
RL4LLM（二）

文章目录

14 [RL4LLM] base vs. instruct model，个性化（custom）chat template（make prefix）
- completion vs. chat
- base model inference
- basics
- vllm inference
- custom chat template
- no think
15 [veRL] 从原理层面理解训练参数，PPO & GRPO，batch size，kl & entropy
- 1 PPO & GRPO
- 2 batchsize
- 3 其他（metrics）
- - KL Loss
  - - entropy
- 4 跑起来、跑得快
16 [veRL] FSDP SFT trainer，SFT vs. RL，交叉熵损失 | loss mask | learning rate scheduler
- data & prompt
- 2 training process
- 3 training loss
- 4 why -log P
- 5 overfitting
- 6 learning rate scheduler
- 7 SFT v.s. RL
17 [veRL] fsdp sft trainer 补充，teacher forcing、shift labels shift logits、loss mask
- 1 tokenize
- 2 padding

14 [RL4LLM] base vs. instruct model，个性化（custom）chat template（make prefix）

https://www.bilibili/video/BV1JZLcz4EUC
https://github/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/base_instruct.ipynb
https://github/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/template_make_prefix.ipynb

这一期主要是讲关于如何让completion模型来QA

completion vs. chat

Q/A, U/A, User/Assistant
- base model 没有身份（role）的概念；
- 严格意义上的语言模型，next token prediction（词语接龙）
- 怎么去回答 QA 的问题，prompt 中定义身份，（设置 max response，以及 stop words 等)；

prompt = f"Q: {question}\nA:"

# 也可以尝试 few-shot，提供一些例子
prompt = f"""
Q: 西班牙的首都是哪里?
A: 马德里

Q: 德国的首都是哪里?
A: 柏林

Q: {question}
A:
"""

prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
prompt += "<|im_start|>assistant\n" # 模型将从这里开始生成

from transformers import AutoTokenizer

base_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B')
instruct_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')

print(base_tokenizer.chat_template)

def make_prefix(numbers, target, template_type):
    # NOTE: also need to change reward_score/countdown.py
    if template_type == 'base':
        # follow deepseek-r1-zero
        """This works for any base model"""
        prefix = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>"""
    elif template_type == 'qwen-instruct':
        """This works for Qwen Instruct Models"""
        prefix = f"""<|im_start|>system\nYou are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>\n<|im_start|>user\n Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>\n<|im_start|>assistant\nLet me solve this step by step.\n<think>"""
    return prefix

numbers = [ 44, 19, 35 ]
target = 99

base_prompt = make_prefix(numbers, target, 'base')
print(base_prompt)
"""
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>
"""

instruct_prompt = make_prefix(numbers, target, 'qwen-instruct')
print(instruct_prompt)
"""
<|im_start|>system
You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>
<|im_start|>user
 Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>
"""

base model inference

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.6, 
    max_tokens=1024
)
base_llm = LLM(model='Qwen/Qwen2.5-3B', max_model_len=1024)

base_resp = base_llm.generate(base_prompt, sampling_params)[0]
print(base_resp.outputs[0].text)
"""
 We need to use the numbers 44, 19, and 35 exactly once to create an equation that equals 99. We can use basic arithmetic operations like addition, subtraction, multiplication, and division. Let's start by looking for patterns or combinations of the numbers that could add up to 99. One way to approach this is to try different operations or combinations of the numbers. </think>
The final answer is: <answer> 44 + 35 + 19 = 99 </answer>
"""

test_resp = base_llm.generate('The captail of China is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
 Beijing.____
A. The capital of China is Beijing.
B. Beijing is the capital of China.
C. The capital of China is Beijing.
D. Beijing is the capital of China.
Answer:
D

The most abundant element in the Earth's crust is ____
A. Oxygen
B. Silicon
C. Aluminum
D. Iron
Answer:
A

Which of the following explanations of the emphasized words in the sentences is incorrect?
A. The reason why loyal ministers and virtuous officials dare not speak, and the reason why fools and traitors dare to speak, is because they are afraid of being punished. Punishment: Punishment.
B. If you want to know the truth, I will tell you. Know: Understand.
C. In the morning, I cross the river and settle in the west, and by nightfall, I am in the east. Cross: Cross.
D. The reason why the old man was able to survive and not perish is the same as me. Pity: Like.
Answer:
A

The starting point of human life is ____
A. Fertilized egg
B. Embryo
C. Infant
D. Newborn
Answer:
A

The solution set for the inequality x^{2}-2x-3>0 is ____
A. (-1, 3)
B. (-∞, -1) ∪ (3, +∞)
C. (-3, 1)
D. (-∞, -3) ∪ (1, +∞)
Answer:
B

The following table shows the number of naval and air force officers and engineers in the North China Military District from 1948 to 1949. This table reflects that the People's Liberation Army ____. | Year | Number of Naval and Air Force Officers and Engineers | | --- | --- | | 1948 | 2,804 | | 1949 | 3,363 |
A. Gradually expanded its scale
B. Won many victories in the southern theater
C. Had a relatively strong combat capability
D. Effectively thwarted the Nationalist army's rearward defense strategy
Answer:
C
"""

test_resp = base_llm.generate('My name is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
 Tom. I am a student. I am in Class Two, Grade Eight. This is my friend, Jack. He is a student, too. He is in Class One, Grade Eight. My Chinese teacher is Mr. Zhang. He is a good teacher. He likes us very much. My English teacher is Miss. Wang. She is very young. She is good with us. She likes us, too. We like them. 根据短文内容，判断正误（正确的写"正确"，错误的写"错误"）。 (1). 2. Miss. Wang is a good Chinese teacher. (2). 3. Tom is in Class Two, Grade Eight. (3). 4. Mr. Zhang is Tom's English teacher. (4). 5. Jack and Tom are in the same class. (5). 1. Jack is a student, too.

【小题1】错误 【小题2】正确 【小题3】正确 【小题4】错误 【小题5】错误

根据汉语意思完成句子。 【 1 】 这个房间是用空气新鲜的木材做的。 This room is made of ___________. 【 2 】 我们必须阻止人们在森林里砍伐树木。 We must _______________ people from cutting down trees in the forest. 【 3 】 请不要把纸屑扔在地板上。 Please don't ___________ the paper on the floor. 【 4 】 环保对我们来说非常重要。 It is ___________ for us to protect the environment. 【 5 】 为了保护我们美丽的地球，我们不能乱扔垃圾。 We can't ___________ rubbish because we must protect our beautiful earth.

【 1 】 fresh air 【 2 】 stop 【 3 】 throw away 【 4 】 important 【 5 】 throw away

阅读下面的文字，完成下列小题。 雪山 谢大立 10月25日，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、草地里艰难行军，这一路上，究竟会有多少红军战死在这崇山峻岭中，又有多少红军战士被饥饿折磨得骨瘦如柴，这一切，已永远地被埋葬在万古长青的雪山之上了。 1936年10月，是红军长征胜利70周年的日子。 一大早，我们一行就匆匆地赶到了雪山脚下。 1936年10月，红军三大主力在甘肃会宁胜利会师，宣告长征胜利结束。但是，虽然红军主力在陕北会师，但还有几支红军队伍在雪山、
"""

test_resp = base_llm.generate('Long long ago, there', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
 was a little girl who loved to play in the house. She picked up everything. She put it away, and then she picked it up again. She put it away, and then she picked it up again. Finally, her mother said, "I'm going to put a sign on the door. Then you won't be able to come in any more." "What sign, Mom?" "It'll say, 'Out of Order'," said her mother. "Oh," said the little girl. Then she went and hid under the bed. A few minutes later, her mother called her, "Come in here." The little girl came out from under the bed. "What's wrong, Mom?" "I put the sign on the door," said her mother, "and I can't open it." 【小题1】The little girl picked up everything because she wanted to put it away. 【小题2】The little girl put it away because her mother asked her to do so. 【小题3】The little girl was very angry with her mother. 【小题4】The mother didn't want to play with the little girl. 【小题5】The mother could not open the door because the sign was on it. 【小题1】T 【小题2】F 【小题3】T 【小题4】T 【小题5】T

阅读下面的文章，完成后面题目。 《红楼梦》中女性形象的复杂性 一、《红楼梦》中女性形象的复杂性 《红楼梦》中人物众多，女性形象更是丰富多彩。 《红楼梦》中女性形象的复杂性，主要表现在以下方面： 1．女性的阶级性。阶级是社会上最本质、最直接的差别。《红楼梦》中女性形象的阶级性，主要表现在她们所处的社会地位的不同。《红楼梦》中女性形象的阶级性，是决定其性格的重要因素，也是决定其命运的重要因素。 2．女性的性别特征。《红楼梦》中女性形象的性别特征，主要表现在其在性别方面所特有的差异上。 3．女性的文学性。文学性是指作品中人物形象所具有的审美价值和艺术魅力。《红楼梦》中女性形象的文学性，主要表现在以下方面：①《红楼梦》中女性形象的典型性。②《红楼梦》中女性形象的艺术性。 4．女性的象征性。《红楼梦》中女性形象的象征性，主要表现在两个方面：①女性形象的隐喻性。②女性形象的隐喻性。 《红楼梦》中女性形象的复杂性，是个性与共性的统一。个性是指《红楼梦》中女性形象所具有的特殊性。共性是指《红楼梦》中女性形象所具有的普遍性，即《红楼梦》中女性形象所具有的共有的品格、气质、思想、性格等。 总之，《红楼梦》中女性形象的复杂性，是个性与共性的统一，是人物形象与社会现实的统一，是人物形象与民族心理的统一。 (选自《红楼梦论丛》，有改动) 1．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，主要表现在她们所处的社会地位的不同。 B．《红楼梦》中女性形象的阶级性，是决定其性格和命运的重要因素。 C．《红楼梦》中女性形象的性别特征，主要表现在其在性别方面所特有的差异上。 D．《红楼梦》中女性形象的文学性，主要表现在其典型性和艺术性。 2．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与社会现实的统一。 B．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与民族心理的统一。 C．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象个性与共性的统一。 D．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 3．下列对《红楼梦》中女性形象复杂性的理解，不正确的一项是 A．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中民族心理的统一。 B．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 C．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中文学性的统一。 D．《红楼梦》中女性形象的复杂性，是《红楼梦》中人物形象与《红楼梦》中阶级
"""

test_resp = base_llm.generate(instruct_prompt, sampling_params)[0]
print(test_resp.outputs[0].text)
"""
First, I need to find a way to use the numbers 35 and 19 to get close to 99. I can start by adding 35 and 19, which gives me 54. Then, I can subtract 54 from 99, which gives me 45. Now, I need to find a way to get from 45 to 44. I can subtract 45 by 1, which gives me -1. But that doesn't work because I can't use -1 as a number in my equation. So, I need to find another way to get from 45 to 44. I can divide 45 by 1.1, which gives me 40.90909090909091. Then, I can subtract 40.90909090909091 by 0.9090909090909091, which gives me 40. Now, I need to find a way to get from 40 to 44. I can multiply 40 by 1.1, which gives me 44. But that doesn't work because I can't use 1.1 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 
"""

basics

prompt vs. response
- prompt: resp.prompt, resp.prompt_token_ids
- response: resp.outputs[0].text, resp.outputs[0].token_ids

make_prefix (TinyZero)

https://github/Jiayi-Pan/TinyZero/blob/main/examples/data_preprocess/countdown.py#L57-L66

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.'

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n'

# custom
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜>'

# custom no think
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
# '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n</think>'

load the parquet dataset
- https://github/Jiayi-Pan/TinyZero/blob/main/verl/utils/dataset/rl_dataset.py#L128
- default
  - https://github/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L169
  - prompt_with_chat_template = self.tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)

generate & reward func

reward func

sequences = torch.cat((valid_prompt_ids, valid_response_ids))
sequences_str = self.tokenizer.decode(sequences)
score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth)

from transformers import AutoTokenizer
import re
import torch

model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
basic_messages = [
    {"role": "user", "content": "3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}."}
]
tokenizer.apply_chat_template(basic_messages, tokenize=False)
tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)

vllm inference

用ollama在ds上让它思考9.11和9.9哪个更大，有时候是没有think的

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.6, 
    max_tokens=32768
)

llm = LLM(model=model_id, max_model_len=32768)
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.prompt)
print(resp.prompt_token_ids)
assert tokenizer.encode(resp.prompt) == resp.prompt_token_ids
tokenizer.decode(151646), tokenizer.decode(7810)
len(resp.outputs[0].token_ids), len(tokenizer.encode(resp.outputs[0].text))

custom chat template

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
**
</think>

To determine which number is bigger between **3.11** and **3.9**, follow these steps:

1. **Compare the whole number part** of both numbers. Both have **3** as the whole number.
2. **Compare the decimal parts**:
   - **0.11** (from 3.11)
   - **0.9** (from 3.9, which can be written as 3.90)
3. **Convert 3.9 to two decimal places**: 3.90
4. **Compare 0.11 and 0.90**:
   - **0.11** is less than **0.90**
5. **Conclusion**: Since 0.11 is less than 0.90, **3.90** is larger than **3.11**.

**Final Answer**: \boxed{3.9}
"""

prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
prompt = '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜>'
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
<think>
Alright, so I've got this problem here: 3.11 and 3.9, and I need to figure out which one is bigger. Hmm, okay. Let me think about how to approach this. I'm pretty sure that when comparing decimals, you start from the left and compare each digit one by one. So, first, I should look at the whole number part of both numbers. 

Both 3.11 and 3.9 have the same whole number part, which is 3. That means the whole numbers are equal, so I can't say one is bigger just yet. I need to look at the decimal parts. 

The first decimal place after the decimal point is the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, that means 3.9 is larger than 3.11. Wait, let me make sure I'm doing this right. 

So, if I write both numbers aligned by their decimal points:

3.11
3.9

I can think of 3.9 as 3.90 to make the comparison easier. Now, comparing 3.11 and 3.90. The first digit after the decimal is 1 vs. 9. Since 9 is bigger, 3.90 is bigger than 3.11. Yeah, that makes sense.

Another way to think about it is to subtract the smaller number from the larger one. If the result is positive, then the first number is bigger. So, 3.90 minus 3.11 is 0.79, which is positive, so 3.90 is indeed bigger. 

Wait, but what if the numbers were, say, 3.11 and 3.99? Then, the tenths place is 1 vs. 9, so 3.99 would still be bigger. But in this case, since the tenths place is only 1 for 3.11, it's clear that 3.9 has a higher tenths place.

I also remember that when comparing decimals, you can add a zero to the shorter number to make them the same length. So, 3.9 becomes 3.90, and then comparing 3.11 and 3.90 is straightforward. 

Is there any chance I might have made a mistake here? Maybe if I misaligned the decimals or added incorrectly. Let me try another approach. I can convert both numbers to fractions. 

3.11 is equal to 311/100, right? Because 3.11 is 3 + 11/100. Similarly, 3.9 is 39/10, which is 390/100. So, comparing 311/100 and 390/100, since 390 is greater than 311, 3.9 is bigger. 

Wait, let me check that. 390 divided by 100 is 3.9, and 311 divided by 100 is 3.11. So, yes, 3.9 is bigger. I think that's solid.

Alternatively, I could think about money. If I have $3.11 and someone else has $3.90, which is more money? Well, $3.90 is more than $3.11 because 90 cents is more than 11 cents. That's a practical way to remember.

So, another confirmation: when money is involved, the decimal places represent cents. So, 3.11 is 3 dollars and 11 cents, and 3.90 is 3 dollars and 90 cents. Clearly, 90 cents is more than 11 cents, so 3.90 is more than 3.11.

Is there any other way to think about this? Maybe using number lines. If I imagine a number line starting at 3.00, then 3.11 is somewhere between 3.00 and 4.00, and 3.90 is even closer to 4.00. Since 3.90 is closer to 4.00, it must be larger than 3.11.

Wait, but how far is each from 3.00? 3.11 is 0.11 away, and 3.90 is 0.90 away. So, clearly, 3.90 is further along the number line, which means it's bigger.

I think I'm overcomplicating it. The straightforward way is to look at the tenths place. Since 9 is greater than 1, 3.9 is bigger than 3.11. 

But just to make sure, let me compare each place step by step. Starting from the left, the units place is the same: 3 in both. Then, moving to the tenths place: 1 vs. 9. Since 9 is bigger, we don't need to check the next decimal places. 

If the tenths place were equal, we would move to the hundredths place, but since they are different, we can stop there. 

Alternatively, I can also think in terms of fractions. 3.11 is 3 and 11/100, and 3.9 is 3 and 90/100. So, 90/100 is definitely larger than 11/100, so 3.9 is larger. 

Wait, just to make sure I'm not missing something, sometimes in decimal comparisons, the number of digits can affect the comparison. For example, if one number has more decimal places, does that mean it's automatically bigger? Well, no, because the more decimal places a number has, the more precise it is. But in this case, both numbers have two decimal places, so the extra digit beyond the decimal point doesn't affect the comparison.

So, 3.11 and 3.90, both have two decimal places, so the difference must be in the tenths place. Therefore, 3.90 is larger than 3.11.

I think I've thought through this from multiple angles now: comparing digit by digit, converting to fractions, thinking about money, using a number line, and even considering the difference from the whole number. All these methods consistently show that 3.9 is bigger than 3.11.

Just to recap, the process is:

1. Compare the whole number parts. Both are 3, so equal.
2. Move to the tenths place: 1 vs. 9. 9 is larger, so 3.9 is bigger.
3. If needed, check the hundredths place, but since they are equal, we can stop here.

So, I can confidently say that 3.9 is bigger than 3.11.

**Final Answer**
The larger number is \boxed{3.9}.
</think>

To determine which number is larger between 3.11 and 3.9, we can follow these steps:

1. Compare the whole number parts. Both numbers have 3 as the whole number part, so they are equal.
2. Move to the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, 3.9 is larger.

Thus, the larger number is \boxed{3.9}.
"""

no think

有时候没有think标签

https://www.bilibili/video/BV1ugRxYeEt4/

prompt = '<｜begin▁of▁sentence｜><｜User｜>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<｜Assistant｜><think>\n</think>'
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
To determine which number is larger between **3.11** and **3.9**, follow these steps:

1. **Compare the whole number parts**: Both numbers have the same whole number part, which is **3**.

2. **Compare the decimal parts**:
   - **0.11** (from 3.11)
   - **0.90** (from 3.9, which can be written as **0.90** to have the same number of decimal places)

3. **Compare the tenths place**:
   - **1** (from 3.11)
   - **9** (from 3.9)

Since **9** is greater than **1**, the tenths place of **3.9** is larger than that of **3.11**.

4. **Conclusion**: Because the tenths place of **3.9** is larger, **3.9** is the larger number.

**Final Answer**: \boxed{3.9}
"""

15 [veRL] 从原理层面理解训练参数，PPO & GRPO，batch size，kl & entropy

video
code

这节主要在讲verl的配置文件怎么写，稍显枯燥

跑起来，跑得对，跑得快；

https://verl.readthedocs.io/en/latest/examples/config.html
https://verl.readthedocs.io/en/latest/perf/perf_tuning.html
https://verl.readthedocs.io/en/latest/perf/device_tuning.html
- https://github/volcengine/verl/blob/main/examples/tuning/7b/qwen2-7b_grpo_2_h800_fsdp_vllm.sh
- https://github/volcengine/verl/blob/main/examples/tuning/14b/qwen2_14b_grpo_4_h800_fsdp_vllm.sh
https://github/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2-7b.sh
- github => deepwiki

例子是一个7B的GRPO例子

main_ppo.py
- 实例化 trainer = RayPPOTrainer，
- trainer.fit
ray_trainer.py 定义 generation/training 的 workflow/pipeline（任务调度）
- generation (experience preparation)
  - generate_sequences
    - ray::WorkerDict.actor_rollout_generate_sequences
  - compute_log_prob
  - compute_ref_log_prob
  - reward_fn
  - advantage
- training
  - update_actor

1 PPO & GRPO

回顾两者的区别与联系，本质上是GRPO少value model，但多了很多次采样

J P P O ( θ ) = E [ q ∼ P ( Q ) , o ∼ π θ o l d ( O ∣ q ) ] 1 ∣ o ∣ ∑ t = 1 ∣ o ∣ min ⁡ [ π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) A t , clip ( π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) , 1 − ϵ , 1 + ϵ ) A t ] \mathcal{J}_{PPO}(\theta) = \mathbb{E}[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)] \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[ \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})} A_t, \text{clip} \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right] JPPO(θ)=E[q∼P(Q),o∼πθold(O∣q)]∣o∣1t=1∑∣o∣min[πθold(ot∣q,o<t)πθ(ot∣q,o<t)At,clip(πθold(ot∣q,o<t)πθ(ot∣q,o<t),1−ϵ,1+ϵ)At]

r r r 的计算（定义在token级别，(reverse) kl term within the reward）
- r t = r ϕ ( q , o ≤ t ) − β log ⁡ π θ ( o t ∣ q , o < t ) π r e f ( o t ∣ q , o < t ) r_t = r_{\phi}(q, o_{\le t}) - \beta \log \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{ref}(o_t|q, o_{<t})} rt=rϕ(q,o≤t)−βlogπref(ot∣q,o<t)πθ(ot∣q,o<t)
GAE（advantage）的计算( r , v r, v r,v => GAE)
- δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)−V(st)
- A ^ t G A E ( γ , λ ) = ∑ l = 0 ∞ ( γ λ ) l δ t + l \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} A^tGAE(γ,λ)=∑l=0∞(γλ)lδt+l
- A ^ t G A E ( γ , λ ) = ∑ l = 0 ∞ ( γ λ ) l ( r t + l + γ V ( s t + l + 1 ) − V ( s t + l ) ) \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l (r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l})) A^tGAE(γ,λ)=∑l=0∞(γλ)l(rt+l+γV(st+l+1)−V(st+l))
  - A ^ t G A E ( γ , λ ) \hat{A}_t^{GAE(\gamma, \lambda)} A^tGAE(γ,λ): 在时间步 t 的广义优势估计。
  - γ \gamma γ: 折扣因子 (discount factor)，通常取值在 0 到 1 之间，表示未来奖励的重要性。
  - λ \lambda λ: GAE 参数，通常取值在 0 到 1 之间，用于在偏差 (bias) 和方差 (variance) 之间进行权衡。
    - 当 λ = 0 \lambda = 0 λ=0 时，GAE 退化为标准的 TD 优势估计： A ^ t = δ t = r t + γ V ( s t + 1 ) − V ( s t ) \hat{A}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) A^t=δt=rt+γV(st+1)−V(st) (低偏差，高方差)。
    - 当 λ = 1 \lambda = 1 λ=1 时，GAE 考虑了直到回合结束的所有 TD 残差的折扣和，类似于蒙特卡洛优势估计 (高偏差，低方差)。

J G R P O ( θ ) = E [ q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O ∣ q ) ] 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ { min ⁡ [ π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) A ^ i , t , clip ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) , 1 − ϵ , 1 + ϵ ) A ^ i , t ] − β D K L [ π θ ∣ ∣ π r e f ] } \mathcal{J}_{GRPO}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)] \\ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}[\pi_{\theta}||\pi_{ref}] \right\} JGRPO(θ)=E[q∼P(Q),{oi}i=1G∼πθold(O∣q)]G1i=1∑G∣oi∣1t=1∑∣oi∣{min[πθold(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t)A^i,t,clip(πθold(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t),1−ϵ,1+ϵ)A^i,t]−βDKL[πθ∣∣πref]}

advantage
- A ^ i , t = r ~ i = r i − mean ( r ) std ( r ) \hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)} A^i,t=r~i=std(r)ri−mean(r)
kl estimation ((reverse) kl term within loss)
- D K L [ π θ ∣ ∣ π r e f ] = π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − log ⁡ π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − 1 \mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - \log \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - 1 DKL[πθ∣∣πref]=πθ(oi,t∣q,oi,<t)πref(oi,t∣q,oi,<t)−logπθ(oi,t∣q,oi,<t)πref(oi,t∣q,oi,<t)−1
- D K L [ π θ ∣ ∣ π r e f ] = ∑ y π θ ( y ∣ q ) log ⁡ π θ ( y ∣ q ) π r e f ( y ∣ q ) = E y ∼ π θ ( ⋅ ∣ q ) [ ∑ t = 1 T log ⁡ π θ ( o t ∣ q , o < t ) π r e f ( o t ∣ q , o < t ) ] \mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \sum_{y} \pi_\theta(y|q) \log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \mathbb{E}_{y \sim \pi_\theta(\cdot|q)} \left[ \sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})} \right] DKL[πθ∣∣πref]=∑yπθ(y∣q)logπref(y∣q)πθ(y∣q)=Ey∼πθ(⋅∣q)[∑t=1Tlogπref(ot∣q,o<t)πθ(ot∣q,o<t)]
  - π ( y ∣ q ) = π ( o 1 , . . . , o T ∣ q ) = ∏ t = 1 T π ( o t ∣ q , o < t ) \pi(y|q) = \pi(o_1, ..., o_T | q) = \prod_{t=1}^{T} \pi(o_t | q, o_{<t}) π(y∣q)=π(o1,...,oT∣q)=∏t=1Tπ(ot∣q,o<t)
  - log ⁡ π θ ( y ∣ q ) π r e f ( y ∣ q ) = log ⁡ ∏ t = 1 T π θ ( o t ∣ q , o < t ) ∏ t = 1 T π r e f ( o t ∣ q , o < t ) \log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \log \frac{\prod_{t=1}^{T} \pi_\theta(o_t | q, o_{<t})}{\prod_{t=1}^{T} \pi_{ref}(o_t | q, o_{<t})} logπref(y∣q)πθ(y∣q)=log∏t=1Tπref(ot∣q,o<t)∏t=1Tπθ(ot∣q,o<t)
    - = ∑ t = 1 T log ⁡ π θ ( o t ∣ q , o < t ) − ∑ t = 1 T log ⁡ π r e f ( o t ∣ q , o < t ) = \sum_{t=1}^{T} \log \pi_\theta(o_t | q, o_{<t}) - \sum_{t=1}^{T} \log \pi_{ref}(o_t | q, o_{<t}) =∑t=1Tlogπθ(ot∣q,o<t)−∑t=1Tlogπref(ot∣q,o<t)
    - = ∑ t = 1 T [ log ⁡ π θ ( o t ∣ q , o < t ) − log ⁡ π r e f ( o t ∣ q , o < t ) ] = \sum_{t=1}^{T} \left[ \log \pi_\theta(o_t | q, o_{<t}) - \log \pi_{ref}(o_t | q, o_{<t}) \right] =∑t=1T[logπθ(ot∣q,o<t)−logπref(ot∣q,o<t)]
    - = ∑ t = 1 T log ⁡ π θ ( o t ∣ q , o < t ) π r e f ( o t ∣ q , o < t ) = \sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})} =∑t=1Tlogπref(ot∣q,o<t)πθ(ot∣q,o<t)
actor.kl_loss_coef：默认 0.001（ppo_trainer.yaml）
- GRPO (use_kl_loss enable)
- kl_loss_type: low_var_kl
  - k3 estimation
algorithm.kl_penalty (=> algorithm.use_kl_in_reward)
- in-reward kl penalty.

2 batchsize

Algorithmic metrics (train batch size, PPO mini-batch size) are global (from a single-controller perspective), normalized in each worker. See the normalization code.

Performance-related parameters (micro batch size, max token length for dynamic batch size) are local parameters that define the per-GPU data allocations. See the normalization code.

data.train_batch_size=32
- prompts
actor_rollout_ref.rollout.n=8
- 每个 prompts sample 多少个 responses（grpo group size）
- generation：train_batch_size * rollout_n
actor.ppo_epochs=1
- actor_rollout_ref.actor.ppo_mini_batch_size=16
  - train_batch_size // ppo_mini_batch_size => ppo training 多少次
- actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
  - 这个是真正的 ppo training batch size
forward-only (without grad (without loss))
- actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32
- actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \

π θ π r e f = exp ⁡ ( log ⁡ π θ − log ⁡ π r e f ) \frac{\pi_{\theta}}{\pi_{ref}}=\exp(\log \pi_\theta - \log \pi_{ref}) πrefπθ=exp(logπθ−logπref)

if not config.actor_rollout_ref.actor.use_dynamic_bsz:
    assert config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_size
    sp_size = config.actor_rollout_ref.actor.get('ulysses_sequence_parallel_size', 1)
    if config.actor_rollout_ref.actor.ppo_micro_batch_size is not None:
        assert config.actor_rollout_ref.actor.ppo_mini_batch_size % config.actor_rollout_ref.actor.ppo_micro_batch_size == 0
        assert config.actor_rollout_ref.actor.ppo_micro_batch_size * sp_size >= n_gpus

....


self.config.actor.ppo_mini_batch_size *= self.config.rollout.n
self.config.actor.ppo_mini_batch_size //= (self.device_mesh.size() // self.ulysses_sequence_parallel_size)

ppo_mini_batch_size = 16 * 8 / 2 = 64
ga = ppo_mini_batch_size / ppo_micro_batch_size_per_gpu = 64 / 8 = 8
一些限制
- config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_size

3 其他（metrics）

L actor ( θ ) = L PG ( θ ) − c 1 L entropy ( θ ) + c 2 L KL ( θ ) \mathcal{L}_{\text{actor}}(\theta) = \mathcal{L}_{\text{PG}}(\theta) - c_1 \mathcal{L}_{\text{entropy}}(\theta) + c_2 \mathcal{L}_{\text{KL}}(\theta) Lactor(θ)=LPG(θ)−c1Lentropy(θ)+c2LKL(θ)

KL Loss

log ⁡ π θ π r e f = log ⁡ π θ − log ⁡ π r e f \log\frac{\pi_\theta}{\pi_{ref}}=\log\pi_\theta - \log \pi_{ref} logπrefπθ=logπθ−logπref

一般来说kl_loss是大于0的

kl_loss > 0: 表示当前策略 $ \pi_\theta$ 平均而言，对采样的响应序列 a a a 分配了比参考策略 $ \pi_{ref}$ 更高的概率。这是 PPO 训练中期望看到的，因为策略正在学习提高那些能带来高回报的序列的概率。
kl_loss < 0: 表示当前策略 π θ \pi_\theta πθ 平均而言，对采样的响应序列 a a a 分配了比参考策略 $ \pi_{ref} $ 更低的概率。这种情况可能在优化过程中短暂出现，或者如果参考策略本身就很擅长生成高回报序列。

entropy

H t = H ( π θ ( ⋅ ∣ s , a < t ) ) = − ∑ a ′ π θ ( a ′ ∣ s , a < t ) log ⁡ π θ ( a ′ ∣ s , a < t ) H_t = H(\pi_{\theta}(\cdot | s, a_{<t})) = - \sum_{a'} \pi_{\theta}(a'|s, a_{<t}) \log \pi_{\theta}(a'|s, a_{<t}) Ht=H(πθ(⋅∣s,a<t))=−a′∑πθ(a′∣s,a<t)logπθ(a′∣s,a<t)

最小0，最大 log ⁡ ∣ V ∣ \log|V| log∣V∣
- 高熵：概率分布较平坦，模型对选择哪个下一个词不确定，倾向于随机探索。
- 低熵：概率分布较尖锐，模型非常确定地倾向于选择某一个或少数几个词。
在 PPO 训练中引入熵损失（作为正则化项）的主要目的是：
- 1. 鼓励探索 (Encourage Exploration)：防止策略过早收敛到局部最优，通过保持一定的随机性来探索更多可能的响应序列。
- 1. 防止策略崩溃 (Prevent Policy Collapse)：避免策略网络变得过于确定性，只输出少数固定模式，从而保持生成的多样性。
注意负号意味着优化器在最小化总损失时，会尝试最大化熵项，从而鼓励探索。
在 PPO 训练初期，策略可能还比较随机，熵会比较高。随着训练进行，策略变得更优化，熵可能会下降。entropy_coeff 的作用就是防止熵下降得过快或过低。

4 跑起来、跑得快

actor_rollout_ref.model.use_remove_padding=True \
fsdp
- actor_rollout_ref.model.enable_gradient_checkpointing=True \
- actor_rollout_ref.actor.fsdp_config.param_offload=False \
- actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
vllm
- >= 0.8
  - https://verl.readthedocs.io/en/latest/README_vllm0.8.html

16 [veRL] FSDP SFT trainer，SFT vs. RL，交叉熵损失 | loss mask | learning rate scheduler

deep learning 框架的一个好的设计和抽象
- Dataset, __getitem__
  - veRL: SFTDataset, RLHFDataset
- Trainer
  - veRL: FSDPSFTTrainer, RayPPOTrainer

import numpy as np
import matplotlib.pyplot as plt

data & prompt

from transformers import AutoTokenizer

T = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')
T.special_tokens_map
"""
{'eos_token': '<|im_end|>',
 'pad_token': '<|endoftext|>',
 'additional_special_tokens': ['<|im_start|>',
  '<|im_end|>',
  '<|object_ref_start|>',
  '<|object_ref_end|>',
  '<|box_start|>',
  '<|box_end|>',
  '<|quad_start|>',
  '<|quad_end|>',
  '<|vision_start|>',
  '<|vision_end|>',
  '<|vision_pad|>',
  '<|image_pad|>',
  '<|video_pad|>']}
"""
T.eos_token # '<|im_end|>'

sft_dataset.py: SFTDataset
- eos_token: <|im_end|> (151645)
  - im: instruct message（from base model to instruct model）
  - 注意不是 <|endoftext|>(pad_token, 151643)，

 # apply chat template
prompt_chat = [{"role": "user", "content": prompt}]

# string
prompt_chat_str = tokenizer.apply_chat_template(prompt_chat, add_generation_prompt=True, tokenize=False)
response_chat_str = response + tokenizer.eos_token

input_ids = torch.cat((prompt_ids, response_ids), dim=-1)
attention_mask = torch.cat((prompt_attention_mask, response_attention_mask), dim=-1)

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
你好，请问今天天气怎么样？
<|im_end|>
<|im_start|>assistant
你好！请告诉我您所在的城市，我可以为您查询天气。
<|im_end|>

2 training process

fsdp_sft_trainer.py
- A lightweight one-file FSDP SFT Trainer
logits
- shape: [batch_size, seq_len, vocab_size]
对齐 logits 与 labels
- labels (目标) 通常是 input_ids 向左移动一位得到 (labels = input_ids[:, 1:].contiguous())。
- shift_logits 是 logits 去掉最后一个时间步的预测 (logits[..., :-1, :])，以与 labels 对齐。
  - shift_logits = logits[..., :-1, :].contiguous()
  - shift_labels = labels.contiguous()
- input_ids = [T1, T2, T3, T4]
  - labels: [T2, T3, T4]
  - shift_logits[…, 0, :]: T2
  - shift_logits[…, 1, :]: T3
  - shift_logits[…, 2, :]: T4
逐 token 计算 loss
- nn.CrossEntropyLoss(reduction="none")
mask loss: mask prompt & eos token (151645, <|im_end|>)
```
loss_mask = attention_mask.clone()
if prompt_length > 1:
    # mask out prompt for SFT.
    loss_mask[: min(prompt_length, loss_mask.size(0)) - 1] = 0
# mask out the last token in response
loss_mask[min(prompt_length + response_length, loss_mask.size(0)) - 1] = 0
```
- 指示哪些token的预测应该被计入损失（例如，在多轮对话中，可能只计算回答部分的损失，而不计算提示部分的损失）。它通常排除了最后一个token，因为最后一个token没有下一个token可供预测。
- multiturn
  - mask prompts，只 supervised response 部分，即只计算 response 的 loss

3 training loss

CE = − log ⁡ ( P true_token ) \text{CE}=-\log(P_{\text{true\_token}}) CE=−log(Ptrue_token)

nll: negative log likelihood
假如训练到 0.3 的 loss 水平
- P true_token = exp ⁡ ( − 0.3 ) = 0.7408 P_{\text{true\_token}}=\exp(-0.3)=0.7408 Ptrue_token=exp(−0.3)=0.7408
考虑到 qwen vocab size 152064 的水平
- 瞎猜 CE rand = − log ⁡ ( 1 152064 ) = 11.93 \text{CE}_{\text{rand}}=-\log(\frac1{152064})=11.93 CErand=−log(1520641)=11.93
从 PPL（perplexity）的角度
- P P L = exp ⁡ ( CE ) PPL=\exp(\text{CE}) PPL=exp(CE)
  - CEL = 0.3 => PPL = 1.35 (能将下一个最可能的词的范围缩小到好像平均只有 1 到 2 个（1.3634）选项一样。)
增大batch size，起到稳定训练的作用，single sentence 容易在 learning 时有波动

np.exp(-0.3), np.log(152064), np.exp(0.3), np.exp(11.93)
"""
(0.7408182206817179,
 11.932056763842207,
 1.3498588075760032,
 151751.56167916086)
"""

句子级别 w = w 1 , w 2 , ⋯ , w N w=w_1,w_2,\cdots, w_N w=w1,w2,⋯,wN，其联合概率
p ( W ) = p ( w 1 ) p ( w 2 ∣ w 1 ) ⋯ p ( w N ∣ w 1 , ⋯ , w N − 1 ) p(W)=p(w_1)p(w_2|w_1)\cdots p(w_N|w_1,\cdots,w_{N-1}) p(W)=p(w1)p(w2∣w1)⋯p(wN∣w1,⋯,wN−1)
avg ce

avg CE = − 1 N ∑ t = 1 N log ⁡ P ( w t ∣ w 1 , ⋯ , w < t ) = − 1 N log ⁡ P ( W ) \begin{split} \text{avg CE}&=-\frac1N\sum_{t=1}^N\log P(w_t|w_1,\cdots,w_{\lt t})\\ &=-\frac1N\log P(W) \end{split} avg CE=−N1t=1∑NlogP(wt∣w1,⋯,w<t)=−N1logP(W)

ppl = exp(ce)
P P L = P ( W ) − 1 N = exp ⁡ ( − 1 N log ⁡ P ( W ) ) = ( exp ⁡ ( log ⁡ P ( W ) ) ) − 1 N = exp ⁡ ( − 1 N log ⁡ P ( W ) ) \begin{split} PPL&=P(W)^{-\frac1N}=\exp(-\frac1N\log P(W))\\ &=(\exp(\log P(W)))^{-\frac1N}\\ &=\exp(-\frac1N\log P(W)) \end{split} PPL=P(W)−N1=exp(−N1logP(W))=(exp(logP(W)))−N1=exp(−N1logP(W))

4 why -log P

P 越远离 1（正确token对应的概率），loss 指数增加；

p_values = np.linspace(0.01, 1, 400)
neg_log_p_values = -np.log(p_values)

p_highlight = np.array([0.1, 0.3, 0.5, 0.7, 0.9])
neg_log_p_highlight = -np.log(p_highlight)

plt.figure(figsize=(4, 8))
plt.plot(p_values, neg_log_p_values, label='-log(P)')
plt.scatter(p_highlight, neg_log_p_highlight, color='red', zorder=5)
for i, txt in enumerate(p_highlight):
    plt.annotate(f'(p={txt:.1f}, y={neg_log_p_highlight[i]:.2f})', 
                 (p_highlight[i], neg_log_p_highlight[i]),
                 textcoords="offset points", 
                 xytext=(0,10), 
                 ha='center')

plt.title('Plot of -log(p)')
plt.xlabel('p')
plt.ylabel('-log(p)')
plt.grid(False)
plt.legend()

5 overfitting

需要持续监控 training losses & val losses
- training losses 不断地下降，但 val losses 先下降后上升；
- 我这边的经验即是，2个epochs，就会达到 val losses 较低的水平，后续会上升；

6 learning rate scheduler

AdamW
- lr=1e-5
get_cosine_schedule_with_warmup

import math

def get_lr_at_step(
    current_step: int,
    initial_lr: float,
    num_warmup_steps: int,
    num_training_steps: int,
    min_lr_ratio: float = 0.0,
    num_cycles: float = 0.5,
):
    assert min_lr_ratio >= 0 and min_lr_ratio <= 1.0
    # 0.5
    coef = (1 - min_lr_ratio) * 0.5
    # 0.5
    intercept = (1 + min_lr_ratio) * 0.5

    if current_step < num_warmup_steps:
        # Linear warmup
        scale = float(current_step) / float(max(1, num_warmup_steps))
    else:
        # Cosine decay
        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))
        x = math.cos(math.pi * float(num_cycles) * 2.0 * progress)
        scale = max(0.0, x * coef + intercept) # Ensure learning rate is not negative
        
    return initial_lr * scale

cos
- cos(pi * 0): 1, cos(pi*1): -1
- 0 -> pi, 变量 x (即 math.cos(…) 的计算结果) 会平滑地从 1 降低到 -1。这就是所谓的“半个余弦周期”——从波峰到波谷。
- 0.5 * x + 0.5
  - 1 (0.5 + 0.5) => 0

initial_lr = 1e-5
# warmup_ratio = 0.1
num_warmup_steps = 7
total_steps = 72 # This is num_training_steps in the function context

lr_values = [get_lr_at_step(step, initial_lr, num_warmup_steps, total_steps) for step in range(1, total_steps+1)]

import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.plot(range(1, total_steps+1), lr_values, marker='o', linestyle='-')
plt.xlabel("Steps")
plt.ylabel("Learning Rate")
plt.title(f"Cosine Schedule with Warmup (LR={initial_lr}, Warmup={num_warmup_steps}, Total={total_steps})")
plt.grid(True)
plt.xticks(list(range(0, total_steps, 5))) # Adjust x-axis ticks for better readability
plt.axvline(x=num_warmup_steps , color='r', linestyle='--', label=f'Warmup End (Step {num_warmup_steps-1})')
plt.legend()

7 SFT v.s. RL

sft：对 response tokens 做 token 级别的监督学习；
- rl：model 自己产生（rollout）完整的 responses，reward model 提供 reward signal；
RL ref model 的 kl loss，不会偏离原来的模型太多；有一种定向增强模型能力的感觉
SFT 关心 loss，RL 关心 reward
- RL 中我们反而不关心 loss 是否下降，反而追求 loss（kl loss）某种意义上的上升，因为它在进行探索；
SFT memorization，RL generalization
SFT 是 token level 的 0/1 奖励, RL 是句子 level 的离散奖励.
- https://qiankunli.github.io/2024/07/28/llm_finetune_practice.html
- https://zhuanlan.zhihu/p/26370587517

17 [veRL] fsdp sft trainer 补充，teacher forcing、shift labels shift logits、loss mask

从代码可读性上来说，verl的sft实现要比trl的sft实现好得多

video: https://www.bilibili/video/BV1eWjtzbEdP
code: https://github/chunhuizhang/llm_rl/blob/main/tutorials/infra/verl/verl_sft_supp.ipynb

labels = input_ids[:, 1:].contiguous()
output = self.fsdp_model(input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, use_cache=False)

p ( W ) = p ( w 1 , w 2 , ⋯ , w N ) = p ( w 1 ) p ( w 2 ∣ w 1 ) ⋯ p ( w N ∣ w 1 , ⋯ , w N − 1 ) p(W)=p(w_1,w_2,\cdots,w_N)=p(w_1)p(w_2|w_1)\cdots p(w_N|w_1,\cdots,w_{N-1}) p(W)=p(w1,w2,⋯,wN)=p(w1)p(w2∣w1)⋯p(wN∣w1,⋯,wN−1)

sft 训练的时候，不需要模型进行任何的 decoding/(auto-regression)generation
- 所有的 response 都是训练数据，这里的 input_ids = prompt + response + eos_token
- 模型的前向只是标准地计算各个位置（预测下一个 token 时）在整个词表上的 logits 分布
  - https://www.bilibili/video/av1005936005/（关于ppl的那一期）
- (batch_size, sequence_length) => (batch_size, sequence_length, vocab_size)
teacher forcing 机制
- 在训练的每一步，都使用真实的目标输出来作为模型下一步的输入，而不是使用模型自己上一步的预测输出。
- 在学生练习的每一步都立即纠正学生的错误，并给出正确答案，让学生基于正确答案继续下一步的学习。
- 学弹钢琴
  - 没有 Teacher Forcing：你弹了一个音符，然后根据你弹的这个音符去想下一个音符。如果你弹错了一个音，后面的旋律可能就全乱了，你可能需要很久才能回到正确的轨道上。
  - 有 Teacher Forcing：你弹了一个音符。不管你弹得对不对，老师立刻告诉你乐谱上正确的下一个音符是什么，让你照着这个正确的音符继续练习。这样你能更快地学会整首曲子正确的弹法。
假如 sequence_length (prompt + response) 为 N，所谓的 shift logits/labels（这个在第16节，即上一节中有提及）
- labels[1:]: N-1, logits[:-1]: N-1
  - 长度对齐，逐 token 算 loss
  - labels[1:]: 相当于左移，当前预测下一个token
  - logits[:-1]: 截断最后一位，eos 对应的 logits 无需预测下一个token
trl/swift training 的时候，除了 metric loss 还会 metric token accuracy
- argmax logits
prompt: 你好吗？
response：我很好，谢谢你！EOS
- EOS：<|im_end|>，消息的结束：（user）prompt 的结束，（assistant）response 的结束

好	吗	？	我	很	好	，	谢	谢	你	!	EOS	x
你	好	吗	？	我	很	好	，	谢	谢	你	!	~~EOS~~

这里把EOS干掉了，因为EOS是不计算loss的
不涉及 auto-regression decoding/generation 的 teacher forcing
response 包含 eos token 都需要监督算 loss

但是有的库也会在prompt上计算loss，这也挺奇怪，有的不算prompt上的loss

import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
tokenizer.special_tokens_map
"""
{'eos_token': '<|im_end|>',
 'pad_token': '<|endoftext|>',
 'additional_special_tokens': ['<|im_start|>',
  '<|im_end|>',
  '<|object_ref_start|>',
  '<|object_ref_end|>',
  '<|box_start|>',
  '<|box_end|>',
  '<|quad_start|>',
  '<|quad_end|>',
  '<|vision_start|>',
  '<|vision_end|>',
  '<|vision_pad|>',
  '<|image_pad|>',
  '<|video_pad|>']}
"""

tokenizer.eos_token, tokenizer.pad_token_id, tokenizer.encode(['<|im_start|>', '<|im_end|>'])
"""
('<|im_end|>', 151643, [151644, 151645])
"""

注意到eos是 <|im_end|>，而非其他的标记（比如<|endoftext|>，这个反而是padding，这个对应的是151643）

prompt = '你好吗？'
response = '我很好，谢谢你！'
prompt_chat = [{"role": "user", "content": prompt}]

之前关于 chat template 的几期内容
- https://www.bilibili/video/BV1LKXSYqE3T/
- https://www.bilibili/video/BV1dsdWYuEXw/
- https://www.bilibili/video/BV1JZLcz4EUC/

prompt_chat_str = tokenizer.apply_chat_template(prompt_chat, add_generation_prompt=True, tokenize=False)
response_chat_str = response + tokenizer.eos_token

prompt_chat_str

这个输出是：

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
你好吗？<|im_end|>
<|im_start|>assistant

然后我们apply到chat template中：

print(tokenizer.apply_chat_template(prompt_chat, tokenize=False))

输出是：

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
你好吗？<|im_end|>

1 tokenize

prompt_ids_output = tokenizer(prompt_chat_str, return_tensors="pt", add_special_tokens=False)
prompt_ids = prompt_ids_output["input_ids"][0]
prompt_attention_mask = prompt_ids_output["attention_mask"][0]
prompt_ids, prompt_attention_mask
"""
(tensor([151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    264,  10950,  17847,
             13, 151645,    198, 151644,    872,    198, 108386, 101037,  11319,
         151645,    198, 151644,  77091,    198]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1]))
"""

tokenizer.decode([108386, 101037,  11319]) # 你好吗？

response_ids_output = tokenizer(response_chat_str, return_tensors="pt", add_special_tokens=False)
response_ids = response_ids_output["input_ids"][0]
response_attention_mask = response_ids_output["attention_mask"][0]
response_ids # tensor([ 35946, 101243,   3837, 116642,   6313, 151645])
tokenizer.decode([35946, 101243,   3837, 116642,   6313, 151645]) # 我很好，谢谢你！<|im_end|>

2 padding

prompt_length = prompt_ids.shape[0]
response_length = response_ids.shape[0]
prompt_length, response_length # (32, 6)

input_ids = torch.cat((prompt_ids, response_ids), dim=-1)
attention_mask = torch.cat((prompt_attention_mask, response_attention_mask), dim=-1)
input_ids.shape, attention_mask.shape # (torch.Size([38]), torch.Size([38]))

sequence_length = input_ids.shape[0]
max_length = 40
padded_input_ids = torch.ones(size=(max_length - sequence_length,), dtype=input_ids.dtype) * tokenizer.pad_token_id
padded_attention_mask = torch.zeros(size=(max_length - sequence_length,), dtype=attention_mask.dtype)
padded_input_ids, padded_attention_mask # (tensor([151643, 151643]), tensor([0, 0]))

input_ids = torch.cat((input_ids, padded_input_ids))
attention_mask = torch.cat((attention_mask, padded_attention_mask))
input_ids, attention_mask
"""
(tensor([151644,   8948,    198,   2610,    525,   1207,  16948,     11,   3465,
            553,  54364,  14817,     13,   1446,    525,    264,  10950,  17847,
             13, 151645,    198, 151644,    872,    198, 108386, 101037,  11319,
         151645,    198, 151644,  77091,    198,  35946, 101243,   3837, 116642,
           6313, 151645, 151643, 151643]),
 tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]))
"""

def compute_position_id_with_mask(mask):
    return torch.clip(torch.cumsum(mask, dim=-1) - 1, min=0, max=None)
position_ids = compute_position_id_with_mask(attention_mask)
position_ids # tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
        18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
        36, 37, 37, 37])

loss_mask = attention_mask.clone()
if prompt_length > 1:
    # mask out prompt for SFT. prompt 的最后一个 token 参与预测，即预测 response 的第一个 token
    loss_mask[: min(prompt_length, loss_mask.size(0)) - 1] = 0
# mask out the last token in response
loss_mask[min(prompt_length + response_length, loss_mask.size(0)) - 1] = 0
min(prompt_length + response_length, loss_mask.size(0)) # 38

# response 包含 eos token 都需要监督算 loss
loss_mask # tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0])

本文标签：学习笔记 RL4LLM

版权声明：本文标题：【学习笔记】RL4LLM（三）内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.roclinux.cn/b/1766486550a3462862.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

Linux大棚 – 不忘初心的技术博客，浮躁时代的安静角落

【学习笔记】RL4LLM（三）

文章目录

14 [RL4LLM] base vs. instruct model，个性化（custom）chat template（make prefix）

completion vs. chat

base model inference

basics

vllm inference

custom chat template

no think

15 [veRL] 从原理层面理解训练参数，PPO & GRPO，batch size，kl & entropy

1 PPO & GRPO

2 batchsize

3 其他（metrics）

KL Loss

entropy

4 跑起来、跑得快

16 [veRL] FSDP SFT trainer，SFT vs. RL，交叉熵损失 | loss mask | learning rate scheduler

data & prompt

2 training process

3 training loss

4 why -log P

5 overfitting

6 learning rate scheduler

7 SFT v.s. RL

17 [veRL] fsdp sft trainer 补充，teacher forcing、shift labels shift logits、loss mask

1 tokenize

2 padding

更多相关文章

Python - 100天到大师学习笔记（2）

全栈之路-前端篇 | 第一讲.基础前置知识【浏览器内核与网络知识】学习笔记

现代大学英语精读第二版（第二册）学习笔记（原文及全文翻译）——9B - Understanding Society and Culture Through Eating（从饮食角度理解社会和文化）

高级英语（张汉熙版）第二册学习笔记（原文及全文翻译）——5 - Disappearing Through The Skylight（从天窗中消失）

【Photoshop】——Ps基础学习笔记

SAP PP学习笔记11 - PP中的MRP相关概念，参数，配置

学习笔记-Speed-Win

Python 学习笔记——Code with mosh课程

学习笔记 Comprehensive and Delicate: An Efficient Transformer for Image Restoration（CVPR2023）

JVM学习笔记-GC（尚硅谷宋红康）

HFSS学习笔记-HFSS软件安装（ANSYS Electronics Suite 2022 R1）

学习笔记之Android

Python从入门到弃坑学习笔记——第一章 Python入门

雨听 | 英语学习笔记（十四）~作文范文：电子设备能促进教育吗？

台式电脑学习笔记

ubuntu学习笔记

JavaWeb---Web基础和Servlet---Eclipse版动力节点学习笔记

MT7628学习笔记（30）——web方式升级固件

人工智能学习笔记 - 初级篇Ⅱ - 图形可视化 - 第2节-简单的正弦图和余弦图

《机器学习实战》学习笔记（四）：基于概率论的分类方法 - 朴素贝叶斯

发表评论

推荐文章

.NET Compact Framework 2.0实战指南：高效部署Adobe Flash内容到移动平台

告别手动，让电脑开机即自动播放Flash中心内容！

关于浏览器中使用迅雷组件文件的问题

无法通过控制面板中的“添加删除程序”来添加删除程序的解决方法_无法添加和或删除您所请求的产品

手机WiFi连通，CSDN为何不上网？答案在这里！

热门文章

Windows 11安装全攻略：快速纯净系统搭建教程！

兼容性设计解密：Program Files (x86)与Program Files的智慧选择

内存清理大法：让你的电脑运行如飞，告别卡顿烦恼

SWF安全再升级：2010年7月24日ESet Nod32的技术突破

移动硬盘损坏无法读取：全面解析与应对策略_移动硬盘磁盘结构损坏且无法读取

linux 下 centos 8Fedora 32下玩游戏 steam dota2 的问题_centos 玩dota2

Java 设置 PowerPoint 幻灯片背景颜色和背景图片：告别手动，拥抱自动化！_java获取幻灯片的背景图片的名字

华为路由器debug版系统安装插件工具

电脑花屏是什么原因 电脑显示器花屏自修方法【详解】_核显电脑开机闪屏花屏

跳转到192.168.1.1，解锁Adobe Flash Player、Flash中心与SWF的官方站点

最新文章

一文教会你AIX系统备份：mksysb实用指南

SWF文件备份失败？这些步骤让你轻松搞定

Win10系统备份轻松搞定：掌握captureimage命令的关键技巧

Linux系统安全小贴士：掌握备份与恢复，安心每一天

省时省心！三步完成电脑系统高效备份！

Ubuntu系统维护秘籍：备份步骤详解，保护你的劳动成果！

Linux系统不哭：高效备份与快速恢复方案

Ubuntu系统安全大计，备份技巧大公开

GHOST教程：系统备份和还原，小白也能变成高手！

Linux备份与恢复必修课：SWF文件安全策略从入门到精通

Exploring the Finest Accommodations: A Comprehensive Guide to Ruston LA Hotels

The Enchanting Experience of ScaliniTella NYC: A Culinary Gem in the Heart of Manhattan

电脑花屏是什么原因电脑显示器花屏自修方法【详解】_核显电脑开机闪屏花屏