Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization during retrieval scores computation #34

Open
violenil opened this issue Aug 6, 2024 · 2 comments
Open

Normalization during retrieval scores computation #34

violenil opened this issue Aug 6, 2024 · 2 comments

Comments

@violenil
Copy link

violenil commented Aug 6, 2024

Hi! Loving the Arena for quick inspection of models :)

I noticed that the scores for the retrieval are computed as dot products, as opposed to cosine similarity, even though the embeddings are not normalized. I manually added normalization during a local deployment and got significantly different results, at least for the jinaai/jina-embeddings-v2-base-en model. Do you think we can add an optional parameter to the model_meta.yml to normalize embeddings during the model.encode call? I'm happy to make a PR.

@Muennighoff
Copy link
Contributor

Thanks! For the arena that's live we're actually using the GCP index which normalizes first & then does dot product i.e. cosine similarity:

feature_norm_type="UNIT_L2_NORM",

We should definitely adapt it for the local index though, but I think it should be done in the models folder i.e. here: https://github.com/embeddings-benchmark/mteb/tree/main/mteb/models

I think we should probably add Jina in this file: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/sentence_transformers_models.py & then activate normalization there so it always uses normalization when loaded via mteb.get_model(... - cc @KennethEnevoldsen

@KennethEnevoldsen
Copy link

Yep, that is totally correct - the https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/ folder is the gold standard reference for evaluated models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants