Skip to content

Commit

Permalink
update: readme + docs
Browse files Browse the repository at this point in the history
  • Loading branch information
soumik12345 committed Oct 3, 2024
1 parent 03d0041 commit f5f0bfe
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 67 deletions.
63 changes: 31 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Hemm: Holistic Evaluation of Multi-modal Generative Models

[![](https://img.shields.io/badge/Hemm-docs-blue)](https://wandb.github.io/Hemm/)

Hemm is a library for performing comprehensive benchmark of text-to-image diffusion models on image quality and prompt comprehension integrated with [Weights & Biases](https://wandb.ai/site) and [Weave](https://wandb.github.io/weave/).

Hemm is highly inspired by the following projects:
Expand All @@ -8,78 +10,75 @@ Hemm is highly inspired by the following projects:
- [T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation](https://karine-h.github.io/T2I-CompBench-new/)
- [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)

> [!WARNING]
> Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
| ![](./docs/assets/evals.gif) |
|:--:|
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |

## Leaderboards

| Leaderboard | Weave Evals |
|---|---|
| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |

## Installation

First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).

```shell
git clone https://github.com/soumik12345/Hemm
git clone https://github.com/wandb/Hemm
cd Hemm
pip install -e ".[core]"
```

## Quickstart

First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).

```python
import weave
from hemm.utils import publish_dataset_to_weave

weave.init(project_name="t2i_eval")

dataset_reference = publish_dataset_to_weave(
dataset_path="HuggingFaceM4/COCO",
prompt_column="sentences",
ground_truth_image_column="image",
split="validation",
dataset_transforms=[
lambda item: {**item, "sentences": item["sentences"]["raw"]}
],
data_limit=5,
)
```
First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.

| ![](./docs/assets/weave_dataset.gif) |
|:--:|
| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |

Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.

```python
import wandb
import weave


from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric


# Initialize Weave and WandB
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
weave.init(project_name="image-quality-leaderboard")


# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your model is not diffusers compatible.
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")


# Add the model to the evaluation pipeline
evaluation_pipeline = EvaluationPipeline(model=model)


# Add PSNR Metric to the evaluation pipeline
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(psnr_metric)


# Add SSIM Metric to the evaluation pipeline
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(ssim_metric)


# Add LPIPS Metric to the evaluation pipeline
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(lpips_metric)


# Get the Weave dataset reference
dataset = weave.ref("COCO:v0").get()


# Evaluate!
evaluation_pipeline(dataset="COCO:v0")
evaluation_pipeline(dataset=dataset)
```

| ![](./docs/assets/weave_leaderboard.gif) |
|:--:|
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |
Binary file added docs/assets/evals.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
67 changes: 32 additions & 35 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,78 +12,75 @@ Hemm is highly inspired by the following projects:

- [GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment](https://arxiv.org/abs/2310.11513)

!!! warning
Hemm is still in early development, the API is subject to change, expect things to break. If you are interested in contributing, please feel free to open an issue and/or raise a pull request.
| ![](./docs/assets/evals.gif) |
|:--:|
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |

## Leaderboards

| Leaderboard | Weave Evals |
|---|---|
| [Rendering prompts with Complex Actions](https://wandb.ai/hemm-eval/mllm-eval-action/reports/Leaderboard-Rendering-prompts-with-Complex-Actions--Vmlldzo5Mjg2Nzky) | [Weave Evals](https://wandb.ai/hemm-eval/mllm-eval-action/weave/evaluations) |

## Installation

First, we recommend you install the PyTorch by visiting [pytorch.org/get-started/locally](https://pytorch.org/get-started/locally/).

```shell
git clone https://github.com/soumik12345/Hemm
git clone https://github.com/wandb/Hemm
cd Hemm
pip install -e ".[core]"
```

## Quickstart

First let's publish a small subset of the MSCOCO validation set as a [Weave Dataset](https://wandb.github.io/weave/guides/core-types/datasets/).
First, you need to publish your evaluation dataset to Weave. Check out [this tutorial](https://weave-docs.wandb.ai/guides/core-types/datasets) that shows you how to publish a dataset on your project.

```python
import weave
from hemm.utils import publish_dataset_to_weave

weave.init(project_name="t2i_eval")

dataset_reference = publish_dataset_to_weave(
dataset_path="HuggingFaceM4/COCO",
prompt_column="sentences",
ground_truth_image_column="image",
split="validation",
dataset_transforms=[
lambda item: {**item, "sentences": item["sentences"]["raw"]}
],
data_limit=5,
)
```

| ![](./assets/weave_dataset.gif) |
|:--:|
| [Weave Datasets](https://wandb.github.io/weave/guides/core-types/datasets/) enable you to collect examples for evaluation and automatically track versions for accurate comparisons. Easily update datasets with the UI and download the latest version locally with a simple API. |

Next, you can evaluate Stable Diffusion 1.4 on image quality metrics as shown in the following code snippet:
Once you have a dataset on your Weave project, you can evaluate a text-to-image generation model on the metrics.

```python
import wandb
import weave

from hemm.eval_pipelines import BaseWeaveModel, EvaluationPipeline
from hemm.metrics.image_quality import LPIPSMetric, PSNRMetric, SSIMMetric

from hemm.eval_pipelines import BaseDiffusionModel, EvaluationPipeline
from hemm.metrics.prompt_alignment import CLIPImageQualityScoreMetric, CLIPScoreMetric


# Initialize Weave and WandB
wandb.init(project="image-quality-leaderboard", job_type="evaluation")
weave.init(project_name="image-quality-leaderboard")


# Initialize the diffusion model to be evaluated as a `weave.Model` using `BaseWeaveModel`
model = BaseWeaveModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")
# The `BaseDiffusionModel` class uses a `diffusers.DiffusionPipeline` under the hood.
# You can write your own model `weave.Model` if your model is not diffusers compatible.
model = BaseDiffusionModel(diffusion_model_name_or_path="CompVis/stable-diffusion-v1-4")


# Add the model to the evaluation pipeline
evaluation_pipeline = EvaluationPipeline(model=model)


# Add PSNR Metric to the evaluation pipeline
psnr_metric = PSNRMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(psnr_metric)


# Add SSIM Metric to the evaluation pipeline
ssim_metric = SSIMMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(ssim_metric)


# Add LPIPS Metric to the evaluation pipeline
lpips_metric = LPIPSMetric(image_size=evaluation_pipeline.image_size)
evaluation_pipeline.add_metric(lpips_metric)


# Get the Weave dataset reference
dataset = weave.ref("COCO:v0").get()


# Evaluate!
evaluation_pipeline(dataset="COCO:v0")
evaluation_pipeline(dataset=dataset)
```

| ![](./assets/weave_leaderboard.gif) |
|:--:|
| The evaluation pipeline will take each example, pass it through your application and score the output on multiple custom scoring functions using [Weave Evaluation](https://wandb.github.io/weave/guides/core-types/evaluations). By doing this, you'll have a view of the performance of your model, and a rich UI to drill into individual ouputs and scores. |

0 comments on commit f5f0bfe

Please sign in to comment.