update: readme + docs

wandb · Oct 3, 2024 · f5f0bfe · f5f0bfe
1 parent 03d0041
commit f5f0bfe
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 67 deletions.
diff --git a/README.md b/README.md
@@ -1,5 +1,7 @@
 # Hemm: Holistic Evaluation of Multi-modal Generative Models
 
+[![](https://img.shields.io/badge/Hemm-docs-blue)](https://wandb.github.io/Hemm/)
+
 Hemm is a library for performing comprehensive benchmark of text-to-image diffusion models on image quality and prompt comprehension integrated with [Weights & Biases](https://wandb.ai/site) and [Weave](https://wandb.github.io/weave/).
 
 Hemm is highly inspired by the following projects:
@@ -8,78 +10,75 @@ Hemm is highly inspired by the following projects:
 - [T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation](https://karine-h.github.io/T2I-CompBench-new/)
 - [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)
 
-> [!WARNING]  
-> Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
+| ![](./docs/assets/evals.gif) | 
+|:--:| 
+| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
+
+## Leaderboards
+
+| Leaderboard | Weave Evals |
+|---|---|
+| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |
 
 ## Installation
 
+First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).
+
 ```shell
-git clone https://github.com/soumik12345/Hemm
+git clone https://github.com/wandb/Hemm
 cd Hemm
 pip install -e ".[core]"
 ```
 
 ## Quickstart
 
-First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).
-
-```python
-import weave
-from hemm.utils import publish_dataset_to_weave
-
-weave.init(project_name="t2i_eval")
-
-dataset_reference = publish_dataset_to_weave(
-    dataset_path="HuggingFaceM4/COCO",
-    prompt_column="sentences",
-    ground_truth_image_column="image",
-    split="validation",
-    dataset_transforms=[
-        lambda item: {**item, "sentences": item["sentences"]["raw"]}
-    ],
-    data_limit=5,
-)
-```
+First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.
 
-| ![](./docs/assets/weave_dataset.gif) | 
-|:--:| 
-| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |
-
-Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
+Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
 
 ```python
 import wandb
 import weave
 
+
 from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
 from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
 
+
 # Initialize Weave and WandB
 wandb.init(project="image-quality-leaderboard", job_type="evaluation")
 weave.init(project_name="image-quality-leaderboard")
 
+
 # Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
+# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
+# You can write your own model `weave.Model` if your model is not diffusers compatible.
 model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
 
+
 # Add the model to the evaluation pipeline
 evaluation_pipeline = EvaluationPipeline(model=model)
 
+
 # Add PSNR Metric to the evaluation pipeline
 psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(psnr_metric)
 
+
 # Add SSIM Metric to the evaluation pipeline
 ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(ssim_metric)
 
+
 # Add LPIPS Metric to the evaluation pipeline
 lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(lpips_metric)
 
+
+# Get the Weave dataset reference
+dataset = weave.ref("COCO:v0").get()
+
+
 # Evaluate!
-evaluation_pipeline(dataset="COCO:v0")
+evaluation_pipeline(dataset=dataset)
 ```
-
-| ![](./docs/assets/weave_leaderboard.gif) | 
-|:--:| 
-| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
diff --git a/docs/assets/evals.gif b/docs/assets/evals.gif
diff --git a/docs/index.md b/docs/index.md
@@ -12,78 +12,75 @@ Hemm is highly inspired by the following projects:
 
 - [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)
 
-!!! warning
-    Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
+| ![](./docs/assets/evals.gif) | 
+|:--:| 
+| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
+
+## Leaderboards
+
+| Leaderboard | Weave Evals |
+|---|---|
+| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |
 
 ## Installation
 
+First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).
+
 ```shell
-git clone https://github.com/soumik12345/Hemm
+git clone https://github.com/wandb/Hemm
 cd Hemm
 pip install -e ".[core]"
 ```
 
 ## Quickstart
 
-First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).
+First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.
 
-```python
-import weave
-from hemm.utils import publish_dataset_to_weave
-
-weave.init(project_name="t2i_eval")
-
-dataset_reference = publish_dataset_to_weave(
-    dataset_path="HuggingFaceM4/COCO",
-    prompt_column="sentences",
-    ground_truth_image_column="image",
-    split="validation",
-    dataset_transforms=[
-        lambda item: {**item, "sentences": item["sentences"]["raw"]}
-    ],
-    data_limit=5,
-)
-```
-
-| ![](./assets/weave_dataset.gif) | 
-|:--:| 
-| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |
-
-Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
+Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.
 
 ```python
 import wandb
 import weave
 
-from hemm.eval_pipelines import BaseWeaveModel, EvaluationPipeline
-from hemm.metrics.image_quality import LPIPSMetric, PSNRMetric, SSIMMetric
+
+from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
+from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric
+
 
 # Initialize Weave and WandB
 wandb.init(project="image-quality-leaderboard", job_type="evaluation")
 weave.init(project_name="image-quality-leaderboard")
 
+
 # Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
-model = BaseWeaveModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
+# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
+# You can write your own model `weave.Model` if your model is not diffusers compatible.
+model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
+
 
 # Add the model to the evaluation pipeline
 evaluation_pipeline = EvaluationPipeline(model=model)
 
+
 # Add PSNR Metric to the evaluation pipeline
 psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(psnr_metric)
 
+
 # Add SSIM Metric to the evaluation pipeline
 ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(ssim_metric)
 
+
 # Add LPIPS Metric to the evaluation pipeline
 lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
 evaluation_pipeline.add_metric(lpips_metric)
 
+
+# Get the Weave dataset reference
+dataset = weave.ref("COCO:v0").get()
+
+
 # Evaluate!
-evaluation_pipeline(dataset="COCO:v0")
+evaluation_pipeline(dataset=dataset)
 ```
-
-| ![](./assets/weave_leaderboard.gif) | 
-|:--:| 
-| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |