Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation metrics #4

Open
mehdidc opened this issue Apr 26, 2023 · 6 comments
Open

Evaluation metrics #4

mehdidc opened this issue Apr 26, 2023 · 6 comments

Comments

@mehdidc
Copy link
Contributor

mehdidc commented Apr 26, 2023

Would be great to have (optional) model evaluation.
Possibilities:

  • CLIP score (e.g. on a reference set of captions like the ones from Parti)
  • FID, or inception distance in general where we could use other models like CLIP to extract features, as Inception is ImageNet specific
  • Possibily the recent ImageReward https://arxiv.org/abs/2304.05977, which relies on a model trained on human rankings and is quite easy to use, they are also planning to make the ranking dataset bigger
@vkramanuj
Copy link
Contributor

Good points. The default captions used validate_and_save_model are from https://github.com/j-min/DallEval, with the intent of eventually adding automatic validation to this repo. There are some options for merging into this repo:

  1. Introduce these metrics during the validate_and_save function. I am partially against this because CLIP score and FID score both involve loading other models/datasets, which would complicate the config and GPU memory consumption as well as the main train script.
  2. Set up an asynchronous function that "watches" the output examples folder that validate_and_save makes, then computes FID/CLIP score when there's an update. This would operate on a separate node/set of GPUs than the original train script and would be invoked by the user separately from the train script.

Which do you think is a better option? Also thanks for the pointer to ImageReward, I will look into it!

@mehdidc
Copy link
Contributor Author

mehdidc commented Apr 28, 2023

I would also go for option 2 for now at least, because of what you said + the need to distribute the computation of all the metrics over the GPUs as well otherwise only rank zero would be used, while others GPUs would wait.

@vkramanuj
Copy link
Contributor

vkramanuj commented Apr 28, 2023

Sounds good to me, it will take me some time to implement this. Let me know if you'd like to take some part of the PR. I see 3 direct parts:

  1. Integration of FID score evaluation (with https://github.com/j-min/DallEval).
  2. CLIPScore evaluation + possibly ImageReward
  3. A watcher given an evaluation pipeline, which would need to sync to the same wandb as the training run.

I have partial implementations on all of these (except ImageReward) which I will push to a working branch soon that we could use as a starting point.

@mehdidc
Copy link
Contributor Author

mehdidc commented May 5, 2023

I can take care of ImageReward, and help with others, so please go ahead and push the working branch so that I extend it. maybe you can do FID and I do CLIPScore, or the other way.

@mehdidc
Copy link
Contributor Author

mehdidc commented May 8, 2023

Another work to consider: https://arxiv.org/abs/2305.01569, similar to ImageReward (they also compare themselves with ImageReward). Code: https://github.com/yuvalkirstain/PickScore

@vkramanuj
Copy link
Contributor

Hi Mehdi, I have added some starting code in the evaluation branch. It's rough but has an implementation of computing clip score directly from a tar file w/o extracting as well as how it would be used by someone in evaluation/quality_metrics_watcher.py. It also has a starting point for FID score. Thanks for the pointer to that paper :-) Heads up, I will be a bit slow to reply to things due to the NeurIPS deadline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants