[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

StevenLau6 · 2024-10-09T12:11:03Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I used the code eval/evaluate_gsm8k.py to evaluate Qwen1.5-7B base model downloaded from huggingface.
The results shows that Qwen1.5-7B base got Acc: 0.4457922668 on gsm8k, which is much lower than the reported score 62.5 (https://huggingface.co/Qwen/Qwen2-7B).
But the Qwen1.5-1.8B base got Acc: 0.382865807 which is similar to the reported score 38.4 (https://huggingface.co/Qwen/Qwen2-1.5B)

Another strange thing is that the Qwen1.5-7B-Chat model got 60.3 on gsm8k (https://huggingface.co/Qwen/Qwen2-7B-Instruct), which is lower than the base model.

Hope to know if there is any typo or the base model is finetuned with the gsm8k training set before the evaluation on the test set?

期望行为 | Expected Behavior

reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k

复现方法 | Steps To Reproduce

I downloaded the gsm8k test set from https://github.com/openai/grade-school-math/tree/master/grade_school_math/data and checked its content is as same as the huggingface parquet https://huggingface.co/datasets/openai/gsm8k/tree/main/main

The few-shot prompt (from https://github.com/QwenLM/Qwen/blob/main/eval/gsm8k_prompt.txt) is correctly added.

I only modified these three lines:
sent = tokenizer.tokenizer.decode(tokens[raw_text_len:]) -> sent = tokenizer.decode(tokens[raw_text_len:])
input_ids = tokenizer.tokenizer.encode(input_txt) -> input_ids = tokenizer.encode(input_txt)
dataset = load_from_disk(args.sample_input_file) -> data_files = {'train': args.sample_input_file+'train.json', 'test': args.sample_input_file+'test.json'}
dataset = load_dataset('json', data_files=data_files)

运行环境 | Environment

My environment: ubuntu 18.04
Tesla V100-SXM2-32GB * 8

python                    3.10.14
torch                     2.2.0
transformers              4.41.2
CUDA: 12.1

备注 | Anything else?

No response

TanateT · 2024-10-10T09:25:24Z

I also encounter the same question. what do I need to modify?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

StevenLau6 commented Oct 9, 2024

TanateT commented Oct 10, 2024

[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

Comments

StevenLau6 commented Oct 9, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

TanateT commented Oct 10, 2024