One-time evaluation? Sound like making final exams for students when we are at school #1

yhyu13 · 2023-11-15T13:16:43Z

We propose to build
fresh one-time questions to evaluate LLMs instead of relying
on static benchmarks.

This is one of your proposal in the paper. It might easy for coding/math problems as they can be generated from almost infinite combinations.

Is there an active community in pulling out such one-time evaluation for other domains?

vishaal27 · 2023-11-15T19:25:27Z

This ICCV paper explores topics on similar lines: https://arxiv.org/abs/2212.02774

andy-yang-1 · 2023-11-16T18:23:34Z

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting.
Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

yhyu13 · 2023-11-17T04:18:33Z

Is there an active community in pulling out such one-time evaluation for other domains?

@yhyu13 @vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.

The only question is, would these benchmarks be considered as authoritarian standards?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One-time evaluation? Sound like making final exams for students when we are at school #1

One-time evaluation? Sound like making final exams for students when we are at school #1

yhyu13 commented Nov 15, 2023

vishaal27 commented Nov 15, 2023

andy-yang-1 commented Nov 16, 2023

yhyu13 commented Nov 17, 2023

One-time evaluation? Sound like making final exams for students when we are at school #1

One-time evaluation? Sound like making final exams for students when we are at school #1

Comments

yhyu13 commented Nov 15, 2023

vishaal27 commented Nov 15, 2023

andy-yang-1 commented Nov 16, 2023

yhyu13 commented Nov 17, 2023