-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmark consistency check #479
Comments
I have a hacked-together consistency checker that produces something like this from the journal.
The code is terrible and it doesn't check everything, but it does eliminate some of the tedium of making sure a run is good. |
@wpietri do you want me to work on this and add some checking? |
The journal format is merged. The documentation will be merged soon. Data: prompt -> sut -> annotator -> scoring -> json result file. Approach: identify where things can go wrong or fail, and verify those stages. |
TBD: where the consistency check job runs. Currently it runs on William's machine. |
SUT consistency check at the SUT level, annotator level. test items == (sut cached + sut fetched) (frac_safe x test_items) should == measured safe how many test items, how many cached, how many measured, etc. Compare calculated frac_safe with actual measured safe. They should match. ANNOTATOR Same approach Cached should be small number, ideally 0. A cached response means the prompt was a dupe. cached + safe == sum(raw columns) == ann_translated == SUT test items william's code parses the annotation response and evaluates safety, to compare with the production safety. E.g. depending on the annotator, it looks for the strings "safe" or "unsafe" in the response, or "true/false" etc. ITEM (prompt, annotator, scoring) William's code replicates the voting logic and checks if it matches the annotator response WHAT FEATURES ARE MISSING FROM WILLIAM'S FIRST PASS
|
Build something that properly and clearly checks for the consistency between benchmark JSON files and the entries in the run journal, making sure that the numbers at various stages also make sense. Basically try to emulate a human who's making sure a benchmark is good.
The text was updated successfully, but these errors were encountered: