You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there an active community in pulling out such one-time evaluation for other domains?
@yhyu13@vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting.
Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.
Is there an active community in pulling out such one-time evaluation for other domains?
@yhyu13@vishaal27 I believe chat arena is one of the best dynamic benchmarks. It shows people's preference with real voting. Besides, dynabench is a good approach. It can also prevent overfitting benchmarks.
The only question is, would these benchmarks be considered as authoritarian standards?
This is one of your proposal in the paper. It might easy for coding/math problems as they can be generated from almost infinite combinations.
Is there an active community in pulling out such one-time evaluation for other domains?
The text was updated successfully, but these errors were encountered: