Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] <title>Cannot reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k #1321

Open
2 tasks done
StevenLau6 opened this issue Oct 9, 2024 · 1 comment
Open
2 tasks done

Comments

@StevenLau6
Copy link

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

I used the code eval/evaluate_gsm8k.py to evaluate Qwen1.5-7B base model downloaded from huggingface.
The results shows that Qwen1.5-7B base got Acc: 0.4457922668 on gsm8k, which is much lower than the reported score 62.5 (https://huggingface.co/Qwen/Qwen2-7B).
But the Qwen1.5-1.8B base got Acc: 0.382865807 which is similar to the reported score 38.4 (https://huggingface.co/Qwen/Qwen2-1.5B)

Another strange thing is that the Qwen1.5-7B-Chat model got 60.3 on gsm8k (https://huggingface.co/Qwen/Qwen2-7B-Instruct), which is lower than the base model.

Hope to know if there is any typo or the base model is finetuned with the gsm8k training set before the evaluation on the test set?

期望行为 | Expected Behavior

reproduce Qwen1.5-7B base model's reported score 62.5 on gsm8k

复现方法 | Steps To Reproduce

I downloaded the gsm8k test set from https://github.com/openai/grade-school-math/tree/master/grade_school_math/data and checked its content is as same as the huggingface parquet https://huggingface.co/datasets/openai/gsm8k/tree/main/main

The few-shot prompt (from https://github.com/QwenLM/Qwen/blob/main/eval/gsm8k_prompt.txt) is correctly added.

I only modified these three lines:
sent = tokenizer.tokenizer.decode(tokens[raw_text_len:]) -> sent = tokenizer.decode(tokens[raw_text_len:])
input_ids = tokenizer.tokenizer.encode(input_txt) -> input_ids = tokenizer.encode(input_txt)
dataset = load_from_disk(args.sample_input_file) -> data_files = {'train': args.sample_input_file+'train.json', 'test': args.sample_input_file+'test.json'}
dataset = load_dataset('json', data_files=data_files)

运行环境 | Environment

My environment: ubuntu 18.04
Tesla V100-SXM2-32GB * 8

python                    3.10.14
torch                     2.2.0
transformers              4.41.2
CUDA: 12.1

备注 | Anything else?

No response

@TanateT
Copy link

TanateT commented Oct 10, 2024

I also encounter the same question. what do I need to modify?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants