Releases: wandb/Hemm
Leaderboard: Rendering prompts with Complex Actions
The Evaluation Dataset
This leaderboard demonstrates the capability of text-to-image generation models to represent prompts with complex actions. Each model is evaluated on a set of 716 prompts with complex actions and interactions between objects such as "The rectangular mirror was hung above the marble sink" or "The brown cat was lying on the blue blanket". The dataset is compiled from compex_train_action
and complex_val_action
subsets of the T2I-CompBench dataset. You can find the evaluation dataset here published as a Weave dataset.
The Metric for Evaluation
We use a multi-modal LLM-based evaluation metric inspired by Section IV.D from T2I-CompBench++. The metric uses a 2-staged prompting strategy with a powerful multi-modal LLM (GPT-4-Turbo).
In the first stage, the MLLM is prompted to describe the generated image with the following system prompt:
You are a helpful assistant meant to describe images is detail. You should pay special attention to the the actions, events, objects and their relationships in the image.
In the second stage, the MLLM is prompted to judge the image concerning the prompt with the following system prompt:
You are a helpful assistant meant to identify the actions, events, objects and their relationships in the image. You have to extract the question, the score, and the explanation from the user's response.
In the user prompt for the second stage, we ask it to evaluate the image using a comprehensive scoring strategy and by also including the description from the previous stage.
Looking at the image and given a detailed description of the image, evaluate if the text "<IMAGE-GENERATION-PROMP>" is correctly portrayed in the image.
Give a score from 1 to 5, according to the following criteria:
5: the image accurately portrayed the actions, events and relationships between objects described in the text.
4: the image portrayed most of the actions, events and relationships but with minor discrepancies.
3: the image depicted some elements, but action relationships between objects are not correct.
2: the image failed to convey the full scope of the text.
1: the image did not depict any actions or events that match the text.
Here are some more rules for scoring that you should follow:
1. The shapes, layouts, orientations, and placements of the objects in the image should be realistic and adhere to physical constraints.
You should deduct 1 point from the score if there are any deformations with respect to the shapes, layouts, orientations, and
placements of the objects in the image.
2. The anatomy of characters, humans, and animals should also be realistic and adhere to realistic constraints, shapes, and proportions.
You should deduct 1 point from the score if there are any deformations with respect to the anatomy of characters, humans, and animals
in the image.
3. The spatial layout of the objects in the image should be consistent with the text prompt. You should deduct 1 point from the score if the
spatial layout of the objects in the image is not consistent with the text prompt.
Here is a detailed explanation of the image:
---
<IMAGE-DESCRIPTION-FROM-STAGE-1>
---
Provide your analysis and explanation to justify the score.
The Leaderboard
Model | Score |
---|---|
FLUX.1-schnell | 0.8193 |
FLUX.1-dev | 0.8061 |
Stable Diffusion 3 Medium | 0.8061 |
PixArt Sigma | 0.8011 |
PixArt Alpha | 0.7606 |
SDXL-1.0 | 0.748 |
SDXL-Turbo | 0.7453 |
Stable Diffusion 2.1 | 0.6961 |