admin 管理员组文章数量: 1184232
目前字数上限是90000词,又溢出了
- RL4LLM(一)
- RL4LLM(二)
文章目录
- 14 [RL4LLM] base vs. instruct model,个性化(custom)chat template(make prefix)
- completion vs. chat
- base model inference
- basics
- vllm inference
- custom chat template
- no think
- 15 [veRL] 从原理层面理解训练参数,PPO & GRPO,batch size,kl & entropy
- 1 PPO & GRPO
- 2 batchsize
- 3 其他(metrics)
- KL Loss
- entropy
- 4 跑起来、跑得快
- 16 [veRL] FSDP SFT trainer,SFT vs. RL,交叉熵损失 | loss mask | learning rate scheduler
- data & prompt
- 2 training process
- 3 training loss
- 4 why -log P
- 5 overfitting
- 6 learning rate scheduler
- 7 SFT v.s. RL
- 17 [veRL] fsdp sft trainer 补充,teacher forcing、shift labels shift logits、loss mask
- 1 tokenize
- 2 padding
14 [RL4LLM] base vs. instruct model,个性化(custom)chat template(make prefix)
- https://www.bilibili/video/BV1JZLcz4EUC
- https://github/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/base_instruct.ipynb
- https://github/chunhuizhang/llm_rl/blob/main/tutorials/tokenizer/template_make_prefix.ipynb
这一期主要是讲关于如何让completion模型来QA
completion vs. chat
- Q/A, U/A, User/Assistant
- base model 没有身份(role)的概念;
- 严格意义上的语言模型,next token prediction(词语接龙)
- 怎么去回答 QA 的问题,prompt 中定义身份,(设置 max response,以及 stop words 等);
prompt = f"Q: {question}\nA:"
# 也可以尝试 few-shot,提供一些例子
prompt = f"""
Q: 西班牙的首都是哪里?
A: 马德里
Q: 德国的首都是哪里?
A: 柏林
Q: {question}
A:
"""
prompt = f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
prompt += f"<|im_start|>user\n{question}<|im_end|>\n"
prompt += "<|im_start|>assistant\n" # 模型将从这里开始生成
from transformers import AutoTokenizer
base_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B')
instruct_tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')
print(base_tokenizer.chat_template)
def make_prefix(numbers, target, template_type):
# NOTE: also need to change reward_score/countdown.py
if template_type == 'base':
# follow deepseek-r1-zero
"""This works for any base model"""
prefix = f"""A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>"""
elif template_type == 'qwen-instruct':
"""This works for Qwen Instruct Models"""
prefix = f"""<|im_start|>system\nYou are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>\n<|im_start|>user\n Using the numbers {numbers}, create an equation that equals {target}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>\n<|im_start|>assistant\nLet me solve this step by step.\n<think>"""
return prefix
numbers = [ 44, 19, 35 ]
target = 99
base_prompt = make_prefix(numbers, target, 'base')
print(base_prompt)
"""
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
User: Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.
Assistant: Let me solve this step by step.
<think>
"""
instruct_prompt = make_prefix(numbers, target, 'qwen-instruct')
print(instruct_prompt)
"""
<|im_start|>system
You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer.<|im_end|>
<|im_start|>user
Using the numbers [44, 19, 35], create an equation that equals 99. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final answer in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>.<|im_end|>
<|im_start|>assistant
Let me solve this step by step.
<think>
"""
base model inference
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=1024
)
base_llm = LLM(model='Qwen/Qwen2.5-3B', max_model_len=1024)
base_resp = base_llm.generate(base_prompt, sampling_params)[0]
print(base_resp.outputs[0].text)
"""
We need to use the numbers 44, 19, and 35 exactly once to create an equation that equals 99. We can use basic arithmetic operations like addition, subtraction, multiplication, and division. Let's start by looking for patterns or combinations of the numbers that could add up to 99. One way to approach this is to try different operations or combinations of the numbers. </think>
The final answer is: <answer> 44 + 35 + 19 = 99 </answer>
"""
test_resp = base_llm.generate('The captail of China is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
Beijing.____
A. The capital of China is Beijing.
B. Beijing is the capital of China.
C. The capital of China is Beijing.
D. Beijing is the capital of China.
Answer:
D
The most abundant element in the Earth's crust is ____
A. Oxygen
B. Silicon
C. Aluminum
D. Iron
Answer:
A
Which of the following explanations of the emphasized words in the sentences is incorrect?
A. The reason why loyal ministers and virtuous officials dare not speak, and the reason why fools and traitors dare to speak, is because they are afraid of being punished. Punishment: Punishment.
B. If you want to know the truth, I will tell you. Know: Understand.
C. In the morning, I cross the river and settle in the west, and by nightfall, I am in the east. Cross: Cross.
D. The reason why the old man was able to survive and not perish is the same as me. Pity: Like.
Answer:
A
The starting point of human life is ____
A. Fertilized egg
B. Embryo
C. Infant
D. Newborn
Answer:
A
The solution set for the inequality x^{2}-2x-3>0 is ____
A. (-1, 3)
B. (-∞, -1) ∪ (3, +∞)
C. (-3, 1)
D. (-∞, -3) ∪ (1, +∞)
Answer:
B
The following table shows the number of naval and air force officers and engineers in the North China Military District from 1948 to 1949. This table reflects that the People's Liberation Army ____. | Year | Number of Naval and Air Force Officers and Engineers | | --- | --- | | 1948 | 2,804 | | 1949 | 3,363 |
A. Gradually expanded its scale
B. Won many victories in the southern theater
C. Had a relatively strong combat capability
D. Effectively thwarted the Nationalist army's rearward defense strategy
Answer:
C
"""
test_resp = base_llm.generate('My name is', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
Tom. I am a student. I am in Class Two, Grade Eight. This is my friend, Jack. He is a student, too. He is in Class One, Grade Eight. My Chinese teacher is Mr. Zhang. He is a good teacher. He likes us very much. My English teacher is Miss. Wang. She is very young. She is good with us. She likes us, too. We like them. 根据短文内容,判断正误(正确的写"正确",错误的写"错误")。 (1). 2. Miss. Wang is a good Chinese teacher. (2). 3. Tom is in Class Two, Grade Eight. (3). 4. Mr. Zhang is Tom's English teacher. (4). 5. Jack and Tom are in the same class. (5). 1. Jack is a student, too.
【小题1】错误 【小题2】正确 【小题3】正确 【小题4】错误 【小题5】错误
根据汉语意思完成句子。 【 1 】 这个房间是用空气新鲜的木材做的。 This room is made of ___________. 【 2 】 我们必须阻止人们在森林里砍伐树木。 We must _______________ people from cutting down trees in the forest. 【 3 】 请不要把纸屑扔在地板上。 Please don't ___________ the paper on the floor. 【 4 】 环保对我们来说非常重要。 It is ___________ for us to protect the environment. 【 5 】 为了保护我们美丽的地球,我们不能乱扔垃圾。 We can't ___________ rubbish because we must protect our beautiful earth.
【 1 】 fresh air 【 2 】 stop 【 3 】 throw away 【 4 】 important 【 5 】 throw away
阅读下面的文字,完成下列小题。 雪山 谢大立 10月25日,是红军长征胜利70周年的日子。 一大早,我们一行就匆匆地赶到了雪山脚下。 1936年10月,红军三大主力在甘肃会宁胜利会师,宣告长征胜利结束。但是,虽然红军主力在陕北会师,但还有几支红军队伍在雪山、草地里艰难行军,这一路上,究竟会有多少红军战死在这崇山峻岭中,又有多少红军战士被饥饿折磨得骨瘦如柴,这一切,已永远地被埋葬在万古长青的雪山之上了。 1936年10月,是红军长征胜利70周年的日子。 一大早,我们一行就匆匆地赶到了雪山脚下。 1936年10月,红军三大主力在甘肃会宁胜利会师,宣告长征胜利结束。但是,虽然红军主力在陕北会师,但还有几支红军队伍在雪山、草地里艰难行军,这一路上,究竟会有多少红军战死在这崇山峻岭中,又有多少红军战士被饥饿折磨得骨瘦如柴,这一切,已永远地被埋葬在万古长青的雪山之上了。 1936年10月,是红军长征胜利70周年的日子。 一大早,我们一行就匆匆地赶到了雪山脚下。 1936年10月,红军三大主力在甘肃会宁胜利会师,宣告长征胜利结束。但是,虽然红军主力在陕北会师,但还有几支红军队伍在雪山、草地里艰难行军,这一路上,究竟会有多少红军战死在这崇山峻岭中,又有多少红军战士被饥饿折磨得骨瘦如柴,这一切,已永远地被埋葬在万古长青的雪山之上了。 1936年10月,是红军长征胜利70周年的日子。 一大早,我们一行就匆匆地赶到了雪山脚下。 1936年10月,红军三大主力在甘肃会宁胜利会师,宣告长征胜利结束。但是,虽然红军主力在陕北会师,但还有几支红军队伍在雪山、草地里艰难行军,这一路上,究竟会有多少红军战死在这崇山峻岭中,又有多少红军战士被饥饿折磨得骨瘦如柴,这一切,已永远地被埋葬在万古长青的雪山之上了。 1936年10月,是红军长征胜利70周年的日子。 一大早,我们一行就匆匆地赶到了雪山脚下。 1936年10月,红军三大主力在甘肃会宁胜利会师,宣告长征胜利结束。但是,虽然红军主力在陕北会师,但还有几支红军队伍在雪山、
"""
test_resp = base_llm.generate('Long long ago, there', sampling_params)[0]
print(test_resp.outputs[0].text)
"""
was a little girl who loved to play in the house. She picked up everything. She put it away, and then she picked it up again. She put it away, and then she picked it up again. Finally, her mother said, "I'm going to put a sign on the door. Then you won't be able to come in any more." "What sign, Mom?" "It'll say, 'Out of Order'," said her mother. "Oh," said the little girl. Then she went and hid under the bed. A few minutes later, her mother called her, "Come in here." The little girl came out from under the bed. "What's wrong, Mom?" "I put the sign on the door," said her mother, "and I can't open it." 【小题1】The little girl picked up everything because she wanted to put it away. 【小题2】The little girl put it away because her mother asked her to do so. 【小题3】The little girl was very angry with her mother. 【小题4】The mother didn't want to play with the little girl. 【小题5】The mother could not open the door because the sign was on it. 【小题1】T 【小题2】F 【小题3】T 【小题4】T 【小题5】T
阅读下面的文章,完成后面题目。 《红楼梦》中女性形象的复杂性 一、《红楼梦》中女性形象的复杂性 《红楼梦》中人物众多,女性形象更是丰富多彩。 《红楼梦》中女性形象的复杂性,主要表现在以下方面: 1.女性的阶级性。阶级是社会上最本质、最直接的差别。《红楼梦》中女性形象的阶级性,主要表现在她们所处的社会地位的不同。《红楼梦》中女性形象的阶级性,是决定其性格的重要因素,也是决定其命运的重要因素。 2.女性的性别特征。《红楼梦》中女性形象的性别特征,主要表现在其在性别方面所特有的差异上。 3.女性的文学性。文学性是指作品中人物形象所具有的审美价值和艺术魅力。《红楼梦》中女性形象的文学性,主要表现在以下方面:①《红楼梦》中女性形象的典型性。②《红楼梦》中女性形象的艺术性。 4.女性的象征性。《红楼梦》中女性形象的象征性,主要表现在两个方面:①女性形象的隐喻性。②女性形象的隐喻性。 《红楼梦》中女性形象的复杂性,是个性与共性的统一。个性是指《红楼梦》中女性形象所具有的特殊性。共性是指《红楼梦》中女性形象所具有的普遍性,即《红楼梦》中女性形象所具有的共有的品格、气质、思想、性格等。 总之,《红楼梦》中女性形象的复杂性,是个性与共性的统一,是人物形象与社会现实的统一,是人物形象与民族心理的统一。 (选自《红楼梦论丛》,有改动) 1.下列对《红楼梦》中女性形象复杂性的理解,不正确的一项是 A.《红楼梦》中女性形象的复杂性,主要表现在她们所处的社会地位的不同。 B.《红楼梦》中女性形象的阶级性,是决定其性格和命运的重要因素。 C.《红楼梦》中女性形象的性别特征,主要表现在其在性别方面所特有的差异上。 D.《红楼梦》中女性形象的文学性,主要表现在其典型性和艺术性。 2.下列对《红楼梦》中女性形象复杂性的理解,不正确的一项是 A.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与社会现实的统一。 B.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与民族心理的统一。 C.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象个性与共性的统一。 D.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 3.下列对《红楼梦》中女性形象复杂性的理解,不正确的一项是 A.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与《红楼梦》中民族心理的统一。 B.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与《红楼梦》中社会现实的统一。 C.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与《红楼梦》中文学性的统一。 D.《红楼梦》中女性形象的复杂性,是《红楼梦》中人物形象与《红楼梦》中阶级
"""
test_resp = base_llm.generate(instruct_prompt, sampling_params)[0]
print(test_resp.outputs[0].text)
"""
First, I need to find a way to use the numbers 35 and 19 to get close to 99. I can start by adding 35 and 19, which gives me 54. Then, I can subtract 54 from 99, which gives me 45. Now, I need to find a way to get from 45 to 44. I can subtract 45 by 1, which gives me -1. But that doesn't work because I can't use -1 as a number in my equation. So, I need to find another way to get from 45 to 44. I can divide 45 by 1.1, which gives me 40.90909090909091. Then, I can subtract 40.90909090909091 by 0.9090909090909091, which gives me 40. Now, I need to find a way to get from 40 to 44. I can multiply 40 by 1.1, which gives me 44. But that doesn't work because I can't use 1.1 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add 40 by 0.4, which gives me 40.4. Then, I can subtract 40.4 by 0.4, which gives me 40. But that doesn't work because I can't use 40 as a number in my equation. So, I need to find another way to get from 40 to 44. I can add
"""
basics
- prompt vs. response
- prompt: resp.prompt, resp.prompt_token_ids
- response:
resp.outputs[0].text,resp.outputs[0].token_ids
make_prefix(TinyZero)- https://github/Jiayi-Pan/TinyZero/blob/main/examples/data_preprocess/countdown.py#L57-L66
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False) # '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.' prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True) # '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<|Assistant|><think>\n' # custom prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True) # '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<|Assistant|>' # custom no think prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True) # '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<|Assistant|><think>\n</think>'- load the parquet dataset
- https://github/Jiayi-Pan/TinyZero/blob/main/verl/utils/dataset/rl_dataset.py#L128
- default
- https://github/volcengine/verl/blob/main/verl/utils/dataset/rl_dataset.py#L169
- prompt_with_chat_template = self.tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=False)
- generate & reward func
- reward func
sequences = torch.cat((valid_prompt_ids, valid_response_ids)) sequences_str = self.tokenizer.decode(sequences) score = compute_score_fn(solution_str=sequences_str, ground_truth=ground_truth)
from transformers import AutoTokenizer
import re
import torch
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
basic_messages = [
{"role": "user", "content": "3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}."}
]
tokenizer.apply_chat_template(basic_messages, tokenize=False)
tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
vllm inference
用ollama在ds上让它思考9.11和9.9哪个更大,有时候是没有think的
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=32768
)
llm = LLM(model=model_id, max_model_len=32768)
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.prompt)
print(resp.prompt_token_ids)
assert tokenizer.encode(resp.prompt) == resp.prompt_token_ids
tokenizer.decode(151646), tokenizer.decode(7810)
len(resp.outputs[0].token_ids), len(tokenizer.encode(resp.outputs[0].text))
custom chat template
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
**
</think>
To determine which number is bigger between **3.11** and **3.9**, follow these steps:
1. **Compare the whole number part** of both numbers. Both have **3** as the whole number.
2. **Compare the decimal parts**:
- **0.11** (from 3.11)
- **0.9** (from 3.9, which can be written as 3.90)
3. **Convert 3.9 to two decimal places**: 3.90
4. **Compare 0.11 and 0.90**:
- **0.11** is less than **0.90**
5. **Conclusion**: Since 0.11 is less than 0.90, **3.90** is larger than **3.11**.
**Final Answer**: \boxed{3.9}
"""
prompt = tokenizer.apply_chat_template(basic_messages, tokenize=False, add_generation_prompt=True)
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
prompt = '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<|Assistant|>'
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
<think>
Alright, so I've got this problem here: 3.11 and 3.9, and I need to figure out which one is bigger. Hmm, okay. Let me think about how to approach this. I'm pretty sure that when comparing decimals, you start from the left and compare each digit one by one. So, first, I should look at the whole number part of both numbers.
Both 3.11 and 3.9 have the same whole number part, which is 3. That means the whole numbers are equal, so I can't say one is bigger just yet. I need to look at the decimal parts.
The first decimal place after the decimal point is the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, that means 3.9 is larger than 3.11. Wait, let me make sure I'm doing this right.
So, if I write both numbers aligned by their decimal points:
3.11
3.9
I can think of 3.9 as 3.90 to make the comparison easier. Now, comparing 3.11 and 3.90. The first digit after the decimal is 1 vs. 9. Since 9 is bigger, 3.90 is bigger than 3.11. Yeah, that makes sense.
Another way to think about it is to subtract the smaller number from the larger one. If the result is positive, then the first number is bigger. So, 3.90 minus 3.11 is 0.79, which is positive, so 3.90 is indeed bigger.
Wait, but what if the numbers were, say, 3.11 and 3.99? Then, the tenths place is 1 vs. 9, so 3.99 would still be bigger. But in this case, since the tenths place is only 1 for 3.11, it's clear that 3.9 has a higher tenths place.
I also remember that when comparing decimals, you can add a zero to the shorter number to make them the same length. So, 3.9 becomes 3.90, and then comparing 3.11 and 3.90 is straightforward.
Is there any chance I might have made a mistake here? Maybe if I misaligned the decimals or added incorrectly. Let me try another approach. I can convert both numbers to fractions.
3.11 is equal to 311/100, right? Because 3.11 is 3 + 11/100. Similarly, 3.9 is 39/10, which is 390/100. So, comparing 311/100 and 390/100, since 390 is greater than 311, 3.9 is bigger.
Wait, let me check that. 390 divided by 100 is 3.9, and 311 divided by 100 is 3.11. So, yes, 3.9 is bigger. I think that's solid.
Alternatively, I could think about money. If I have $3.11 and someone else has $3.90, which is more money? Well, $3.90 is more than $3.11 because 90 cents is more than 11 cents. That's a practical way to remember.
So, another confirmation: when money is involved, the decimal places represent cents. So, 3.11 is 3 dollars and 11 cents, and 3.90 is 3 dollars and 90 cents. Clearly, 90 cents is more than 11 cents, so 3.90 is more than 3.11.
Is there any other way to think about this? Maybe using number lines. If I imagine a number line starting at 3.00, then 3.11 is somewhere between 3.00 and 4.00, and 3.90 is even closer to 4.00. Since 3.90 is closer to 4.00, it must be larger than 3.11.
Wait, but how far is each from 3.00? 3.11 is 0.11 away, and 3.90 is 0.90 away. So, clearly, 3.90 is further along the number line, which means it's bigger.
I think I'm overcomplicating it. The straightforward way is to look at the tenths place. Since 9 is greater than 1, 3.9 is bigger than 3.11.
But just to make sure, let me compare each place step by step. Starting from the left, the units place is the same: 3 in both. Then, moving to the tenths place: 1 vs. 9. Since 9 is bigger, we don't need to check the next decimal places.
If the tenths place were equal, we would move to the hundredths place, but since they are different, we can stop there.
Alternatively, I can also think in terms of fractions. 3.11 is 3 and 11/100, and 3.9 is 3 and 90/100. So, 90/100 is definitely larger than 11/100, so 3.9 is larger.
Wait, just to make sure I'm not missing something, sometimes in decimal comparisons, the number of digits can affect the comparison. For example, if one number has more decimal places, does that mean it's automatically bigger? Well, no, because the more decimal places a number has, the more precise it is. But in this case, both numbers have two decimal places, so the extra digit beyond the decimal point doesn't affect the comparison.
So, 3.11 and 3.90, both have two decimal places, so the difference must be in the tenths place. Therefore, 3.90 is larger than 3.11.
I think I've thought through this from multiple angles now: comparing digit by digit, converting to fractions, thinking about money, using a number line, and even considering the difference from the whole number. All these methods consistently show that 3.9 is bigger than 3.11.
Just to recap, the process is:
1. Compare the whole number parts. Both are 3, so equal.
2. Move to the tenths place: 1 vs. 9. 9 is larger, so 3.9 is bigger.
3. If needed, check the hundredths place, but since they are equal, we can stop here.
So, I can confidently say that 3.9 is bigger than 3.11.
**Final Answer**
The larger number is \boxed{3.9}.
</think>
To determine which number is larger between 3.11 and 3.9, we can follow these steps:
1. Compare the whole number parts. Both numbers have 3 as the whole number part, so they are equal.
2. Move to the tenths place. In 3.11, the tenths place is 1, and in 3.9, the tenths place is 9. Since 9 is greater than 1, 3.9 is larger.
Thus, the larger number is \boxed{3.9}.
"""
no think
有时候没有think标签
- https://www.bilibili/video/BV1ugRxYeEt4/
prompt = '<|begin▁of▁sentence|><|User|>3.11 and 3.9 which is bigger? Please reason step by step, and put your final answer within \\boxed{}.<|Assistant|><think>\n</think>'
prompt
resp = llm.generate(prompt, sampling_params=sampling_params)[0]
print(resp.outputs[0].text)
"""
To determine which number is larger between **3.11** and **3.9**, follow these steps:
1. **Compare the whole number parts**: Both numbers have the same whole number part, which is **3**.
2. **Compare the decimal parts**:
- **0.11** (from 3.11)
- **0.90** (from 3.9, which can be written as **0.90** to have the same number of decimal places)
3. **Compare the tenths place**:
- **1** (from 3.11)
- **9** (from 3.9)
Since **9** is greater than **1**, the tenths place of **3.9** is larger than that of **3.11**.
4. **Conclusion**: Because the tenths place of **3.9** is larger, **3.9** is the larger number.
**Final Answer**: \boxed{3.9}
"""
15 [veRL] 从原理层面理解训练参数,PPO & GRPO,batch size,kl & entropy
video
code
这节主要在讲verl的配置文件怎么写,稍显枯燥
跑起来,跑得对,跑得快;
- https://verl.readthedocs.io/en/latest/examples/config.html
- https://verl.readthedocs.io/en/latest/perf/perf_tuning.html
- https://verl.readthedocs.io/en/latest/perf/device_tuning.html
- https://github/volcengine/verl/blob/main/examples/tuning/7b/qwen2-7b_grpo_2_h800_fsdp_vllm.sh
- https://github/volcengine/verl/blob/main/examples/tuning/14b/qwen2_14b_grpo_4_h800_fsdp_vllm.sh
- https://github/volcengine/verl/blob/main/examples/grpo_trainer/run_qwen2-7b.sh
- github => deepwiki
例子是一个7B的GRPO例子
- main_ppo.py
- 实例化
trainer = RayPPOTrainer, trainer.fit
- 实例化
- ray_trainer.py 定义 generation/training 的 workflow/pipeline(任务调度)
- generation (experience preparation)
- generate_sequences
ray::WorkerDict.actor_rollout_generate_sequences
- compute_log_prob
- compute_ref_log_prob
- reward_fn
- advantage
- generate_sequences
- training
- update_actor
- generation (experience preparation)
1 PPO & GRPO
回顾两者的区别与联系,本质上是GRPO少value model,但多了很多次采样
J P P O ( θ ) = E [ q ∼ P ( Q ) , o ∼ π θ o l d ( O ∣ q ) ] 1 ∣ o ∣ ∑ t = 1 ∣ o ∣ min [ π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) A t , clip ( π θ ( o t ∣ q , o < t ) π θ o l d ( o t ∣ q , o < t ) , 1 − ϵ , 1 + ϵ ) A t ] \mathcal{J}_{PPO}(\theta) = \mathbb{E}[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)] \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[ \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})} A_t, \text{clip} \left( \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{\theta_{old}}(o_t|q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_t \right] JPPO(θ)=E[q∼P(Q),o∼πθold(O∣q)]∣o∣1t=1∑∣o∣min[πθold(ot∣q,o<t)πθ(ot∣q,o<t)At,clip(πθold(ot∣q,o<t)πθ(ot∣q,o<t),1−ϵ,1+ϵ)At]
-
r
r
r 的计算 (定义在token级别,(reverse) kl term within the reward)
- r t = r ϕ ( q , o ≤ t ) − β log π θ ( o t ∣ q , o < t ) π r e f ( o t ∣ q , o < t ) r_t = r_{\phi}(q, o_{\le t}) - \beta \log \frac{\pi_{\theta}(o_t|q, o_{<t})}{\pi_{ref}(o_t|q, o_{<t})} rt=rϕ(q,o≤t)−βlogπref(ot∣q,o<t)πθ(ot∣q,o<t)
- GAE(advantage)的计算(
r
,
v
r, v
r,v => GAE)
- δ t = r t + γ V ( s t + 1 ) − V ( s t ) \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) δt=rt+γV(st+1)−V(st)
- A ^ t G A E ( γ , λ ) = ∑ l = 0 ∞ ( γ λ ) l δ t + l \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} A^tGAE(γ,λ)=∑l=0∞(γλ)lδt+l
-
A
^
t
G
A
E
(
γ
,
λ
)
=
∑
l
=
0
∞
(
γ
λ
)
l
(
r
t
+
l
+
γ
V
(
s
t
+
l
+
1
)
−
V
(
s
t
+
l
)
)
\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l (r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l}))
A^tGAE(γ,λ)=∑l=0∞(γλ)l(rt+l+γV(st+l+1)−V(st+l))
- A ^ t G A E ( γ , λ ) \hat{A}_t^{GAE(\gamma, \lambda)} A^tGAE(γ,λ): 在时间步 t 的广义优势估计。
- γ \gamma γ: 折扣因子 (discount factor),通常取值在 0 到 1 之间,表示未来奖励的重要性。
-
λ
\lambda
λ: GAE 参数,通常取值在 0 到 1 之间,用于在偏差 (bias) 和方差 (variance) 之间进行权衡。
- 当 λ = 0 \lambda = 0 λ=0 时,GAE 退化为标准的 TD 优势估计: A ^ t = δ t = r t + γ V ( s t + 1 ) − V ( s t ) \hat{A}_t = \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) A^t=δt=rt+γV(st+1)−V(st) (低偏差,高方差)。
- 当 λ = 1 \lambda = 1 λ=1 时,GAE 考虑了直到回合结束的所有 TD 残差的折扣和,类似于蒙特卡洛优势估计 (高偏差,低方差)。
J G R P O ( θ ) = E [ q ∼ P ( Q ) , { o i } i = 1 G ∼ π θ o l d ( O ∣ q ) ] 1 G ∑ i = 1 G 1 ∣ o i ∣ ∑ t = 1 ∣ o i ∣ { min [ π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) A ^ i , t , clip ( π θ ( o i , t ∣ q , o i , < t ) π θ o l d ( o i , t ∣ q , o i , < t ) , 1 − ϵ , 1 + ϵ ) A ^ i , t ] − β D K L [ π θ ∣ ∣ π r e f ] } \mathcal{J}_{GRPO}(\theta) = \mathbb{E}[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)] \\ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t}|q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}[\pi_{\theta}||\pi_{ref}] \right\} JGRPO(θ)=E[q∼P(Q),{oi}i=1G∼πθold(O∣q)]G1i=1∑G∣oi∣1t=1∑∣oi∣{min[πθold(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t)A^i,t,clip(πθold(oi,t∣q,oi,<t)πθ(oi,t∣q,oi,<t),1−ϵ,1+ϵ)A^i,t]−βDKL[πθ∣∣πref]}
-
advantage
- A ^ i , t = r ~ i = r i − mean ( r ) std ( r ) \hat{A}_{i,t} = \tilde{r}_i = \frac{r_i - \text{mean}(r)}{\text{std}(r)} A^i,t=r~i=std(r)ri−mean(r)
-
kl estimation ((reverse) kl term within loss)
- D K L [ π θ ∣ ∣ π r e f ] = π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − log π r e f ( o i , t ∣ q , o i , < t ) π θ ( o i , t ∣ q , o i , < t ) − 1 \mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - \log \frac{\pi_{ref}(o_{i,t}|q, o_{i,<t})}{\pi_\theta(o_{i,t}|q, o_{i,<t})} - 1 DKL[πθ∣∣πref]=πθ(oi,t∣q,oi,<t)πref(oi,t∣q,oi,<t)−logπθ(oi,t∣q,oi,<t)πref(oi,t∣q,oi,<t)−1
-
D
K
L
[
π
θ
∣
∣
π
r
e
f
]
=
∑
y
π
θ
(
y
∣
q
)
log
π
θ
(
y
∣
q
)
π
r
e
f
(
y
∣
q
)
=
E
y
∼
π
θ
(
⋅
∣
q
)
[
∑
t
=
1
T
log
π
θ
(
o
t
∣
q
,
o
<
t
)
π
r
e
f
(
o
t
∣
q
,
o
<
t
)
]
\mathbb{D}_{KL}[\pi_\theta || \pi_{ref}] = \sum_{y} \pi_\theta(y|q) \log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \mathbb{E}_{y \sim \pi_\theta(\cdot|q)} \left[ \sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})} \right]
DKL[πθ∣∣πref]=∑yπθ(y∣q)logπref(y∣q)πθ(y∣q)=Ey∼πθ(⋅∣q)[∑t=1Tlogπref(ot∣q,o<t)πθ(ot∣q,o<t)]
- π ( y ∣ q ) = π ( o 1 , . . . , o T ∣ q ) = ∏ t = 1 T π ( o t ∣ q , o < t ) \pi(y|q) = \pi(o_1, ..., o_T | q) = \prod_{t=1}^{T} \pi(o_t | q, o_{<t}) π(y∣q)=π(o1,...,oT∣q)=∏t=1Tπ(ot∣q,o<t)
-
log
π
θ
(
y
∣
q
)
π
r
e
f
(
y
∣
q
)
=
log
∏
t
=
1
T
π
θ
(
o
t
∣
q
,
o
<
t
)
∏
t
=
1
T
π
r
e
f
(
o
t
∣
q
,
o
<
t
)
\log \frac{\pi_\theta(y|q)}{\pi_{ref}(y|q)} = \log \frac{\prod_{t=1}^{T} \pi_\theta(o_t | q, o_{<t})}{\prod_{t=1}^{T} \pi_{ref}(o_t | q, o_{<t})}
logπref(y∣q)πθ(y∣q)=log∏t=1Tπref(ot∣q,o<t)∏t=1Tπθ(ot∣q,o<t)
- = ∑ t = 1 T log π θ ( o t ∣ q , o < t ) − ∑ t = 1 T log π r e f ( o t ∣ q , o < t ) = \sum_{t=1}^{T} \log \pi_\theta(o_t | q, o_{<t}) - \sum_{t=1}^{T} \log \pi_{ref}(o_t | q, o_{<t}) =∑t=1Tlogπθ(ot∣q,o<t)−∑t=1Tlogπref(ot∣q,o<t)
- = ∑ t = 1 T [ log π θ ( o t ∣ q , o < t ) − log π r e f ( o t ∣ q , o < t ) ] = \sum_{t=1}^{T} \left[ \log \pi_\theta(o_t | q, o_{<t}) - \log \pi_{ref}(o_t | q, o_{<t}) \right] =∑t=1T[logπθ(ot∣q,o<t)−logπref(ot∣q,o<t)]
- = ∑ t = 1 T log π θ ( o t ∣ q , o < t ) π r e f ( o t ∣ q , o < t ) = \sum_{t=1}^{T} \log \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{ref}(o_t | q, o_{<t})} =∑t=1Tlogπref(ot∣q,o<t)πθ(ot∣q,o<t)
-
actor.kl_loss_coef:默认 0.001(ppo_trainer.yaml)- GRPO (
use_kl_lossenable) kl_loss_type:low_var_kl- k3 estimation
- GRPO (
-
algorithm.kl_penalty(=>algorithm.use_kl_in_reward)- in-reward kl penalty.
2 batchsize
- Algorithmic metrics (train batch size, PPO mini-batch size) are global (from a single-controller perspective), normalized in each worker. See the normalization code.
- Performance-related parameters (micro batch size, max token length for dynamic batch size) are local parameters that define the per-GPU data allocations. See the normalization code.
- data.train_batch_size=32
- prompts
- actor_rollout_ref.rollout.n=8
- 每个 prompts sample 多少个 responses(grpo group size)
- generation:train_batch_size * rollout_n
- actor.ppo_epochs=1
- actor_rollout_ref.actor.ppo_mini_batch_size=16
- train_batch_size // ppo_mini_batch_size => ppo training 多少次
- actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8
- 这个是真正的 ppo training batch size
- actor_rollout_ref.actor.ppo_mini_batch_size=16
- forward-only (without grad (without loss))
- actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32
- actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
π θ π r e f = exp ( log π θ − log π r e f ) \frac{\pi_{\theta}}{\pi_{ref}}=\exp(\log \pi_\theta - \log \pi_{ref}) πrefπθ=exp(logπθ−logπref)
if not config.actor_rollout_ref.actor.use_dynamic_bsz:
assert config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_size
sp_size = config.actor_rollout_ref.actor.get('ulysses_sequence_parallel_size', 1)
if config.actor_rollout_ref.actor.ppo_micro_batch_size is not None:
assert config.actor_rollout_ref.actor.ppo_mini_batch_size % config.actor_rollout_ref.actor.ppo_micro_batch_size == 0
assert config.actor_rollout_ref.actor.ppo_micro_batch_size * sp_size >= n_gpus
....
self.config.actor.ppo_mini_batch_size *= self.config.rollout.n
self.config.actor.ppo_mini_batch_size //= (self.device_mesh.size() // self.ulysses_sequence_parallel_size)
-
ppo_mini_batch_size = 16 * 8 / 2 = 64
-
ga = ppo_mini_batch_size / ppo_micro_batch_size_per_gpu = 64 / 8 = 8
-
一些限制
config.data.train_batch_size >= config.actor_rollout_ref.actor.ppo_mini_batch_size
3 其他(metrics)
L actor ( θ ) = L PG ( θ ) − c 1 L entropy ( θ ) + c 2 L KL ( θ ) \mathcal{L}_{\text{actor}}(\theta) = \mathcal{L}_{\text{PG}}(\theta) - c_1 \mathcal{L}_{\text{entropy}}(\theta) + c_2 \mathcal{L}_{\text{KL}}(\theta) Lactor(θ)=LPG(θ)−c1Lentropy(θ)+c2LKL(θ)
KL Loss
log π θ π r e f = log π θ − log π r e f \log\frac{\pi_\theta}{\pi_{ref}}=\log\pi_\theta - \log \pi_{ref} logπrefπθ=logπθ−logπref
一般来说kl_loss是大于0的
- kl_loss > 0: 表示当前策略 $ \pi_\theta$ 平均而言,对采样的响应序列 a a a 分配了比参考策略 $ \pi_{ref}$ 更高的概率。这是 PPO 训练中期望看到的,因为策略正在学习提高那些能带来高回报的序列的概率。
- kl_loss < 0: 表示当前策略 π θ \pi_\theta πθ 平均而言,对采样的响应序列 a a a 分配了比参考策略 $ \pi_{ref} $ 更低的概率。这种情况可能在优化过程中短暂出现,或者如果参考策略本身就很擅长生成高回报序列。
entropy
H t = H ( π θ ( ⋅ ∣ s , a < t ) ) = − ∑ a ′ π θ ( a ′ ∣ s , a < t ) log π θ ( a ′ ∣ s , a < t ) H_t = H(\pi_{\theta}(\cdot | s, a_{<t})) = - \sum_{a'} \pi_{\theta}(a'|s, a_{<t}) \log \pi_{\theta}(a'|s, a_{<t}) Ht=H(πθ(⋅∣s,a<t))=−a′∑πθ(a′∣s,a<t)logπθ(a′∣s,a<t)
- 最小0,最大
log
∣
V
∣
\log|V|
log∣V∣
- 高熵:概率分布较平坦,模型对选择哪个下一个词不确定,倾向于随机探索。
- 低熵:概率分布较尖锐,模型非常确定地倾向于选择某一个或少数几个词。
- 在 PPO 训练中引入熵损失(作为正则化项)的主要目的是:
-
- 鼓励探索 (Encourage Exploration):防止策略过早收敛到局部最优,通过保持一定的随机性来探索更多可能的响应序列。
-
- 防止策略崩溃 (Prevent Policy Collapse):避免策略网络变得过于确定性,只输出少数固定模式,从而保持生成的多样性。
-
- 注意负号意味着优化器在最小化总损失时,会尝试最大化熵项,从而鼓励探索。
- 在 PPO 训练初期,策略可能还比较随机,熵会比较高。随着训练进行,策略变得更优化,熵可能会下降。entropy_coeff 的作用就是防止熵下降得过快或过低。
4 跑起来、跑得快
- actor_rollout_ref.model.use_remove_padding=True \
- fsdp
- actor_rollout_ref.model.enable_gradient_checkpointing=True \
- actor_rollout_ref.actor.fsdp_config.param_offload=False \
- actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
- vllm
>= 0.8- https://verl.readthedocs.io/en/latest/README_vllm0.8.html
16 [veRL] FSDP SFT trainer,SFT vs. RL,交叉熵损失 | loss mask | learning rate scheduler
- deep learning 框架的一个好的设计和抽象
- Dataset,
__getitem__- veRL:
SFTDataset,RLHFDataset
- veRL:
- Trainer
- veRL:
FSDPSFTTrainer,RayPPOTrainer
- veRL:
- Dataset,
import numpy as np
import matplotlib.pyplot as plt
data & prompt
from transformers import AutoTokenizer
T = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-7B-Instruct')
T.special_tokens_map
"""
{'eos_token': '<|im_end|>',
'pad_token': '<|endoftext|>',
'additional_special_tokens': ['<|im_start|>',
'<|im_end|>',
'<|object_ref_start|>',
'<|object_ref_end|>',
'<|box_start|>',
'<|box_end|>',
'<|quad_start|>',
'<|quad_end|>',
'<|vision_start|>',
'<|vision_end|>',
'<|vision_pad|>',
'<|image_pad|>',
'<|video_pad|>']}
"""
T.eos_token # '<|im_end|>'
sft_dataset.py: SFTDataset- eos_token:
<|im_end|>(151645)- im: instruct message(from base model to instruct model)
- 注意不是
<|endoftext|>(pad_token, 151643),
- eos_token:
# apply chat template
prompt_chat = [{"role": "user", "content": prompt}]
# string
prompt_chat_str = tokenizer.apply_chat_template(prompt_chat, add_generation_prompt=True, tokenize=False)
response_chat_str = response + tokenizer.eos_token
input_ids = torch.cat((prompt_ids, response_ids), dim=-1)
attention_mask = torch.cat((prompt_attention_mask, response_attention_mask), dim=-1)
<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
你好,请问今天天气怎么样?
<|im_end|>
<|im_start|>assistant
你好!请告诉我您所在的城市,我可以为您查询天气。
<|im_end|>
2 training process
fsdp_sft_trainer.py- A lightweight one-file FSDP SFT Trainer
- logits
- shape:
[batch_size, seq_len, vocab_size]
- shape:
- 对齐 logits 与 labels
labels(目标) 通常是input_ids向左移动一位得到 (labels = input_ids[:, 1:].contiguous())。shift_logits是 logits 去掉最后一个时间步的预测 (logits[..., :-1, :]),以与 labels 对齐。shift_logits = logits[..., :-1, :].contiguous()shift_labels = labels.contiguous()
- input_ids =
[T1, T2, T3, T4]- labels: [T2, T3, T4]
- shift_logits[…, 0, :]: T2
- shift_logits[…, 1, :]: T3
- shift_logits[…, 2, :]: T4
- 逐 token 计算 loss
nn.CrossEntropyLoss(reduction="none")
- mask loss: mask prompt & eos token (151645,
<|im_end|>)loss_mask = attention_mask.clone() if prompt_length > 1: # mask out prompt for SFT. loss_mask[: min(prompt_length, loss_mask.size(0)) - 1] = 0 # mask out the last token in response loss_mask[min(prompt_length + response_length, loss_mask.size(0)) - 1] = 0- 指示哪些token的预测应该被计入损失(例如,在多轮对话中,可能只计算回答部分的损失,而不计算提示部分的损失)。它通常排除了最后一个token,因为最后一个token没有下一个token可供预测。
- multiturn
- mask prompts,只 supervised response 部分,即只计算 response 的 loss
3 training loss
CE = − log ( P true_token ) \text{CE}=-\log(P_{\text{true\_token}}) CE=−log(Ptrue_token)
- nll: negative log likelihood
- 假如训练到 0.3 的 loss 水平
- P true_token = exp ( − 0.3 ) = 0.7408 P_{\text{true\_token}}=\exp(-0.3)=0.7408 Ptrue_token=exp(−0.3)=0.7408
- 考虑到 qwen vocab size 152064 的水平
- 瞎猜 CE rand = − log ( 1 152064 ) = 11.93 \text{CE}_{\text{rand}}=-\log(\frac1{152064})=11.93 CErand=−log(1520641)=11.93
- 从 PPL(perplexity)的角度
-
P
P
L
=
exp
(
CE
)
PPL=\exp(\text{CE})
PPL=exp(CE)
- CEL = 0.3 => PPL = 1.35 (能将下一个最可能的词的范围缩小到好像平均只有 1 到 2 个(1.3634)选项一样。)
-
P
P
L
=
exp
(
CE
)
PPL=\exp(\text{CE})
PPL=exp(CE)
- 增大batch size,起到稳定训练的作用,single sentence 容易在 learning 时有波动
np.exp(-0.3), np.log(152064), np.exp(0.3), np.exp(11.93)
"""
(0.7408182206817179,
11.932056763842207,
1.3498588075760032,
151751.56167916086)
"""
- 句子级别
w
=
w
1
,
w
2
,
⋯
,
w
N
w=w_1,w_2,\cdots, w_N
w=w1,w2,⋯,wN,其联合概率
p ( W ) = p ( w 1 ) p ( w 2 ∣ w 1 ) ⋯ p ( w N ∣ w 1 , ⋯ , w N − 1 ) p(W)=p(w_1)p(w_2|w_1)\cdots p(w_N|w_1,\cdots,w_{N-1}) p(W)=p(w1)p(w2∣w1)⋯p(wN∣w1,⋯,wN−1) - avg ce
avg CE = − 1 N ∑ t = 1 N log P ( w t ∣ w 1 , ⋯ , w < t ) = − 1 N log P ( W ) \begin{split} \text{avg CE}&=-\frac1N\sum_{t=1}^N\log P(w_t|w_1,\cdots,w_{\lt t})\\ &=-\frac1N\log P(W) \end{split} avg CE=−N1t=1∑NlogP(wt∣w1,⋯,w<t)=−N1logP(W)
- ppl = exp(ce)
P P L = P ( W ) − 1 N = exp ( − 1 N log P ( W ) ) = ( exp ( log P ( W ) ) ) − 1 N = exp ( − 1 N log P ( W ) ) \begin{split} PPL&=P(W)^{-\frac1N}=\exp(-\frac1N\log P(W))\\ &=(\exp(\log P(W)))^{-\frac1N}\\ &=\exp(-\frac1N\log P(W)) \end{split} PPL=P(W)−N1=exp(−N1logP(W))=(exp(logP(W)))−N1=exp(−N1logP(W))
4 why -log P
- P 越远离 1(正确token对应的概率),loss 指数增加;
p_values = np.linspace(0.01, 1, 400)
neg_log_p_values = -np.log(p_values)
p_highlight = np.array([0.1, 0.3, 0.5, 0.7, 0.9])
neg_log_p_highlight = -np.log(p_highlight)
plt.figure(figsize=(4, 8))
plt.plot(p_values, neg_log_p_values, label='-log(P)')
plt.scatter(p_highlight, neg_log_p_highlight, color='red', zorder=5)
for i, txt in enumerate(p_highlight):
plt.annotate(f'(p={txt:.1f}, y={neg_log_p_highlight[i]:.2f})',
(p_highlight[i], neg_log_p_highlight[i]),
textcoords="offset points",
xytext=(0,10),
ha='center')
plt.title('Plot of -log(p)')
plt.xlabel('p')
plt.ylabel('-log(p)')
plt.grid(False)
plt.legend()
5 overfitting
- 需要持续监控 training losses & val losses
- training losses 不断地下降,但 val losses 先下降后上升;
- 我这边的经验即是,2个epochs,就会达到 val losses 较低的水平,后续会上升;
6 learning rate scheduler
- AdamW
lr=1e-5
get_cosine_schedule_with_warmup
import math
def get_lr_at_step(
current_step: int,
initial_lr: float,
num_warmup_steps: int,
num_training_steps: int,
min_lr_ratio: float = 0.0,
num_cycles: float = 0.5,
):
assert min_lr_ratio >= 0 and min_lr_ratio <= 1.0
# 0.5
coef = (1 - min_lr_ratio) * 0.5
# 0.5
intercept = (1 + min_lr_ratio) * 0.5
if current_step < num_warmup_steps:
# Linear warmup
scale = float(current_step) / float(max(1, num_warmup_steps))
else:
# Cosine decay
progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps))
x = math.cos(math.pi * float(num_cycles) * 2.0 * progress)
scale = max(0.0, x * coef + intercept) # Ensure learning rate is not negative
return initial_lr * scale
- cos
- cos(pi * 0): 1, cos(pi*1): -1
- 0 -> pi, 变量 x (即 math.cos(…) 的计算结果) 会平滑地从 1 降低到 -1。这就是所谓的“半个余弦周期”——从波峰到波谷。
- 0.5 * x + 0.5
- 1 (0.5 + 0.5) => 0
initial_lr = 1e-5
# warmup_ratio = 0.1
num_warmup_steps = 7
total_steps = 72 # This is num_training_steps in the function context
lr_values = [get_lr_at_step(step, initial_lr, num_warmup_steps, total_steps) for step in range(1, total_steps+1)]
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 4))
plt.plot(range(1, total_steps+1), lr_values, marker='o', linestyle='-')
plt.xlabel("Steps")
plt.ylabel("Learning Rate")
plt.title(f"Cosine Schedule with Warmup (LR={initial_lr}, Warmup={num_warmup_steps}, Total={total_steps})")
plt.grid(True)
plt.xticks(list(range(0, total_steps, 5))) # Adjust x-axis ticks for better readability
plt.axvline(x=num_warmup_steps , color='r', linestyle='--', label=f'Warmup End (Step {num_warmup_steps-1})')
plt.legend()
7 SFT v.s. RL
- sft:对 response tokens 做 token 级别的监督学习;
- rl:model 自己产生(rollout)完整的 responses,reward model 提供 reward signal;
- RL ref model 的 kl loss,不会偏离原来的模型太多;有一种定向增强模型能力的感觉
- SFT 关心 loss,RL 关心 reward
- RL 中我们反而不关心 loss 是否下降,反而追求 loss(kl loss) 某种意义上的上升,因为它在进行探索;
- SFT memorization,RL generalization
- SFT 是 token level 的 0/1 奖励, RL 是句子 level 的离散奖励.
- https://qiankunli.github.io/2024/07/28/llm_finetune_practice.html
- https://zhuanlan.zhihu/p/26370587517
17 [veRL] fsdp sft trainer 补充,teacher forcing、shift labels shift logits、loss mask
从代码可读性上来说,verl的sft实现要比trl的sft实现好得多
- video: https://www.bilibili/video/BV1eWjtzbEdP
- code: https://github/chunhuizhang/llm_rl/blob/main/tutorials/infra/verl/verl_sft_supp.ipynb
labels = input_ids[:, 1:].contiguous()
output = self.fsdp_model(input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, use_cache=False)
p ( W ) = p ( w 1 , w 2 , ⋯ , w N ) = p ( w 1 ) p ( w 2 ∣ w 1 ) ⋯ p ( w N ∣ w 1 , ⋯ , w N − 1 ) p(W)=p(w_1,w_2,\cdots,w_N)=p(w_1)p(w_2|w_1)\cdots p(w_N|w_1,\cdots,w_{N-1}) p(W)=p(w1,w2,⋯,wN)=p(w1)p(w2∣w1)⋯p(wN∣w1,⋯,wN−1)
-
sft 训练的时候,不需要模型进行任何的 decoding/(auto-regression)generation
- 所有的 response 都是训练数据,这里的
input_ids = prompt + response + eos_token - 模型的前向只是标准地计算各个位置(预测下一个 token 时)在整个词表上的 logits 分布
- https://www.bilibili/video/av1005936005/(关于ppl的那一期)
(batch_size, sequence_length)=>(batch_size, sequence_length, vocab_size)
- 所有的 response 都是训练数据,这里的
-
teacher forcing 机制
- 在训练的每一步,都使用真实的目标输出来作为模型下一步的输入,而不是使用模型自己上一步的预测输出。
- 在学生练习的每一步都立即纠正学生的错误,并给出正确答案,让学生基于正确答案继续下一步的学习。
- 学弹钢琴
- 没有 Teacher Forcing: 你弹了一个音符,然后根据你弹的这个音符去想下一个音符。如果你弹错了一个音,后面的旋律可能就全乱了,你可能需要很久才能回到正确的轨道上。
- 有 Teacher Forcing: 你弹了一个音符。不管你弹得对不对,老师立刻告诉你乐谱上正确的下一个音符是什么,让你照着这个正确的音符继续练习。这样你能更快地学会整首曲子正确的弹法。
-
假如 sequence_length (prompt + response) 为 N,所谓的 shift logits/labels(这个在第16节,即上一节中有提及)
labels[1:]: N-1,logits[:-1]: N-1- 长度对齐,逐 token 算 loss
labels[1:]: 相当于左移,当前预测下一个tokenlogits[:-1]: 截断最后一位,eos 对应的 logits 无需预测下一个token
-
trl/swift training 的时候,除了 metric loss 还会 metric token accuracy
- argmax logits
-
prompt:
你好吗? -
response:
我很好,谢谢你!EOS- EOS:
<|im_end|>,消息的结束:(user)prompt 的结束,(assistant)response 的结束
- EOS:
| 我 | 很 | 好 | , | 谢 | 谢 | 你 | ! | EOS | x | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ? | 我 | 很 | 好 | , | 谢 | 谢 | 你 | ! |
- 这里把EOS干掉了,因为EOS是不计算loss的
- 不涉及 auto-regression decoding/generation 的 teacher forcing
- response 包含 eos token 都需要监督算 loss
但是有的库也会在prompt上计算loss,这也挺奇怪,有的不算prompt上的loss
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
tokenizer.special_tokens_map
"""
{'eos_token': '<|im_end|>',
'pad_token': '<|endoftext|>',
'additional_special_tokens': ['<|im_start|>',
'<|im_end|>',
'<|object_ref_start|>',
'<|object_ref_end|>',
'<|box_start|>',
'<|box_end|>',
'<|quad_start|>',
'<|quad_end|>',
'<|vision_start|>',
'<|vision_end|>',
'<|vision_pad|>',
'<|image_pad|>',
'<|video_pad|>']}
"""
tokenizer.eos_token, tokenizer.pad_token_id, tokenizer.encode(['<|im_start|>', '<|im_end|>'])
"""
('<|im_end|>', 151643, [151644, 151645])
"""
注意到eos是 <|im_end|>,而非其他的标记(比如<|endoftext|>,这个反而是padding,这个对应的是151643)
prompt = '你好吗?'
response = '我很好,谢谢你!'
prompt_chat = [{"role": "user", "content": prompt}]
- 之前关于 chat template 的几期内容
- https://www.bilibili/video/BV1LKXSYqE3T/
- https://www.bilibili/video/BV1dsdWYuEXw/
- https://www.bilibili/video/BV1JZLcz4EUC/
prompt_chat_str = tokenizer.apply_chat_template(prompt_chat, add_generation_prompt=True, tokenize=False)
response_chat_str = response + tokenizer.eos_token
prompt_chat_str
这个输出是:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
你好吗?<|im_end|>
<|im_start|>assistant
然后我们apply到chat template中:
print(tokenizer.apply_chat_template(prompt_chat, tokenize=False))
输出是:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
你好吗?<|im_end|>
1 tokenize
prompt_ids_output = tokenizer(prompt_chat_str, return_tensors="pt", add_special_tokens=False)
prompt_ids = prompt_ids_output["input_ids"][0]
prompt_attention_mask = prompt_ids_output["attention_mask"][0]
prompt_ids, prompt_attention_mask
"""
(tensor([151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465,
553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847,
13, 151645, 198, 151644, 872, 198, 108386, 101037, 11319,
151645, 198, 151644, 77091, 198]),
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1]))
"""
tokenizer.decode([108386, 101037, 11319]) # 你好吗?
response_ids_output = tokenizer(response_chat_str, return_tensors="pt", add_special_tokens=False)
response_ids = response_ids_output["input_ids"][0]
response_attention_mask = response_ids_output["attention_mask"][0]
response_ids # tensor([ 35946, 101243, 3837, 116642, 6313, 151645])
tokenizer.decode([35946, 101243, 3837, 116642, 6313, 151645]) # 我很好,谢谢你!<|im_end|>
2 padding
prompt_length = prompt_ids.shape[0]
response_length = response_ids.shape[0]
prompt_length, response_length # (32, 6)
input_ids = torch.cat((prompt_ids, response_ids), dim=-1)
attention_mask = torch.cat((prompt_attention_mask, response_attention_mask), dim=-1)
input_ids.shape, attention_mask.shape # (torch.Size([38]), torch.Size([38]))
sequence_length = input_ids.shape[0]
max_length = 40
padded_input_ids = torch.ones(size=(max_length - sequence_length,), dtype=input_ids.dtype) * tokenizer.pad_token_id
padded_attention_mask = torch.zeros(size=(max_length - sequence_length,), dtype=attention_mask.dtype)
padded_input_ids, padded_attention_mask # (tensor([151643, 151643]), tensor([0, 0]))
input_ids = torch.cat((input_ids, padded_input_ids))
attention_mask = torch.cat((attention_mask, padded_attention_mask))
input_ids, attention_mask
"""
(tensor([151644, 8948, 198, 2610, 525, 1207, 16948, 11, 3465,
553, 54364, 14817, 13, 1446, 525, 264, 10950, 17847,
13, 151645, 198, 151644, 872, 198, 108386, 101037, 11319,
151645, 198, 151644, 77091, 198, 35946, 101243, 3837, 116642,
6313, 151645, 151643, 151643]),
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]))
"""
def compute_position_id_with_mask(mask):
return torch.clip(torch.cumsum(mask, dim=-1) - 1, min=0, max=None)
position_ids = compute_position_id_with_mask(attention_mask)
position_ids # tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 37, 37])
loss_mask = attention_mask.clone()
if prompt_length > 1:
# mask out prompt for SFT. prompt 的最后一个 token 参与预测,即预测 response 的第一个 token
loss_mask[: min(prompt_length, loss_mask.size(0)) - 1] = 0
# mask out the last token in response
loss_mask[min(prompt_length + response_length, loss_mask.size(0)) - 1] = 0
min(prompt_length + response_length, loss_mask.size(0)) # 38
# response 包含 eos token 都需要监督算 loss
loss_mask # tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0])
版权声明:本文标题:【学习笔记】RL4LLM(三) 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.roclinux.cn/b/1766486550a3462862.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论