Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One-time evaluation? Sound like making final exams for students when we are at school #1

Open
yhyu13 opened this issue Nov 15, 2023 · 3 comments

Comments

@yhyu13
Copy link

yhyu13 commented Nov 15, 2023

We propose to build
fresh one-time questions to evaluate LLMs instead of relying
on static benchmarks.

This is one of your proposal in the paper. It might easy for coding/math problems as they can be generated from almost infinite combinations.

Is there an active community in pulling out such one-time evaluation for other domains?

@vishaal27
Copy link

This ICCV paper explores topics on similar lines: https://arxiv.org/abs/2212.02774

@andy-yang-1
Copy link
Collaborator

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting.
Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

@yhyu13
Copy link
Author

yhyu13 commented Nov 17, 2023

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

The only question is, would these benchmarks be considered as authoritarian standards?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants