Evaluation metrics #4

mehdidc · 2023-04-26T14:27:34Z

Would be great to have (optional) model evaluation.
Possibilities:

CLIP score (e.g. on a reference set of captions like the ones from Parti)
FID, or inception distance in general where we could use other models like CLIP to extract features, as Inception is ImageNet specific
Possibily the recent ImageReward https://arxiv.org/abs/2304.05977, which relies on a model trained on human rankings and is quite easy to use, they are also planning to make the ranking dataset bigger

vkramanuj · 2023-04-27T19:37:51Z

Good points. The default captions used validate_and_save_model are from https://github.com/j-min/DallEval, with the intent of eventually adding automatic validation to this repo. There are some options for merging into this repo:

Introduce these metrics during the validate_and_save function. I am partially against this because CLIP score and FID score both involve loading other models/datasets, which would complicate the config and GPU memory consumption as well as the main train script.
Set up an asynchronous function that "watches" the output examples folder that validate_and_save makes, then computes FID/CLIP score when there's an update. This would operate on a separate node/set of GPUs than the original train script and would be invoked by the user separately from the train script.

Which do you think is a better option? Also thanks for the pointer to ImageReward, I will look into it!

mehdidc · 2023-04-28T09:18:25Z

I would also go for option 2 for now at least, because of what you said + the need to distribute the computation of all the metrics over the GPUs as well otherwise only rank zero would be used, while others GPUs would wait.

vkramanuj · 2023-04-28T20:34:14Z

Sounds good to me, it will take me some time to implement this. Let me know if you'd like to take some part of the PR. I see 3 direct parts:

Integration of FID score evaluation (with https://github.com/j-min/DallEval).
CLIPScore evaluation + possibly ImageReward
A watcher given an evaluation pipeline, which would need to sync to the same wandb as the training run.

I have partial implementations on all of these (except ImageReward) which I will push to a working branch soon that we could use as a starting point.

mehdidc · 2023-05-05T14:24:26Z

I can take care of ImageReward, and help with others, so please go ahead and push the working branch so that I extend it. maybe you can do FID and I do CLIPScore, or the other way.

mehdidc · 2023-05-08T11:53:03Z

Another work to consider: https://arxiv.org/abs/2305.01569, similar to ImageReward (they also compare themselves with ImageReward). Code: https://github.com/yuvalkirstain/PickScore

vkramanuj · 2023-05-08T23:15:41Z

Hi Mehdi, I have added some starting code in the evaluation branch. It's rough but has an implementation of computing clip score directly from a tar file w/o extracting as well as how it would be used by someone in evaluation/quality_metrics_watcher.py. It also has a starting point for FID score. Thanks for the pointer to that paper :-) Heads up, I will be a bit slow to reply to things due to the NeurIPS deadline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation metrics #4

Evaluation metrics #4

mehdidc commented Apr 26, 2023

vkramanuj commented Apr 27, 2023

mehdidc commented Apr 28, 2023

vkramanuj commented Apr 28, 2023 •

edited

Loading

mehdidc commented May 5, 2023

mehdidc commented May 8, 2023 •

edited

Loading

vkramanuj commented May 8, 2023

Evaluation metrics #4

Evaluation metrics #4

Comments

mehdidc commented Apr 26, 2023

vkramanuj commented Apr 27, 2023

mehdidc commented Apr 28, 2023

vkramanuj commented Apr 28, 2023 • edited Loading

mehdidc commented May 5, 2023

mehdidc commented May 8, 2023 • edited Loading

vkramanuj commented May 8, 2023

vkramanuj commented Apr 28, 2023 •

edited

Loading

mehdidc commented May 8, 2023 •

edited

Loading