Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused with image_features minus text_features #48

Open
CharlesGong12 opened this issue Jun 23, 2024 · 2 comments
Open

Confused with image_features minus text_features #48

CharlesGong12 opened this issue Jun 23, 2024 · 2 comments

Comments

@CharlesGong12
Copy link

Hi thanks for your amazing work!
I am confused with the subtraction operation image_features minus text_features. The image features is encoded by
CLIPVisionModelWithProjection but the text features is encoded by CLIPTextModel, which doesn't have a projection operation. Therefore why can we directly use image_features minus text_features? It seems that image features and text features are not in the same space.

@CharlesGong12
Copy link
Author

The sdpipeline's encode_prompt is here[https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L302]

@haofanwang
Copy link
Member

Good eye! Thanks for your feedback! @CharlesGong12

Let me make it clear, for SDXL model, the second text encoder is CLIPVisionModelWithProjection, and the pooled feature is only from the 2nd encoder as text_features. For SD1.5 model, it is indeed a CLIPTextModel, so in our inference code, its text_feature is extracted manually as here.

Hope this helps. Please let me know if you have further question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants