Normalization during retrieval scores computation #34

violenil · 2024-08-06T12:22:33Z

Hi! Loving the Arena for quick inspection of models :)

I noticed that the scores for the retrieval are computed as dot products, as opposed to cosine similarity, even though the embeddings are not normalized. I manually added normalization during a local deployment and got significantly different results, at least for the jinaai/jina-embeddings-v2-base-en model. Do you think we can add an optional parameter to the model_meta.yml to normalize embeddings during the model.encode call? I'm happy to make a PR.

The text was updated successfully, but these errors were encountered:

Muennighoff · 2024-08-07T17:17:08Z

Thanks! For the arena that's live we're actually using the GCP index which normalizes first & then does dot product i.e. cosine similarity:

arena/retrieval/gcp_index.py

Line 160 in 64a8780

feature_norm_type="UNIT_L2_NORM",

We should definitely adapt it for the local index though, but I think it should be done in the models folder i.e. here: https://github.com/embeddings-benchmark/mteb/tree/main/mteb/models

I think we should probably add Jina in this file: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/sentence_transformers_models.py & then activate normalization there so it always uses normalization when loaded via mteb.get_model(... - cc @KennethEnevoldsen

KennethEnevoldsen · 2024-08-07T17:37:47Z

Yep, that is totally correct - the https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/ folder is the gold standard reference for evaluated models.

violenil mentioned this issue Aug 6, 2024

Add normalization kwarg for encode #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization during retrieval scores computation #34

Normalization during retrieval scores computation #34

violenil commented Aug 6, 2024 •

edited

Loading

Muennighoff commented Aug 7, 2024

KennethEnevoldsen commented Aug 7, 2024

Normalization during retrieval scores computation #34

Normalization during retrieval scores computation #34

Comments

violenil commented Aug 6, 2024 • edited Loading

Muennighoff commented Aug 7, 2024

KennethEnevoldsen commented Aug 7, 2024

violenil commented Aug 6, 2024 •

edited

Loading