Skip to content

Domain adaption for an embedding model using unsupervised and supervised finetuning on scientific texts for the SciFact retrieval task.

Notifications You must be signed in to change notification settings

sn2727/finetuning-embedding-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Finetuning of embedding models

Embedding models are crucial components in modern natural language processing, serving as the foundation for numerous applications, particularly in Retrieval-Augmented Generation (RAG) systems. These models transform text into dense vector representations, enabling efficient semantic search and comparison. However, despite great generalizability abilities of state-of-the-art embedding models, the models still can be improved when applied to specific tasks or domains.

This respository provides short examples and explanations on fine-tuning methods for embedding models. The used dataset is SciFact consisting of scientific texts and according queries which are used for the retrieval task that is also part of the MTEB benchmark. The notebook main.ipynb contains explanations for the used methods, hyperparameters and training data.

Results

finetuning results on retrieval task
gte model finetuned with different methods and evaluated on SciFact retrieval task

Embedding models are used off-the-shelf in their general version even when used for specific domains. However, even with this very simplistic finetuning on a small dataset (10k sentences) the retrieval performance already improves slightly.

About

Domain adaption for an embedding model using unsupervised and supervised finetuning on scientific texts for the SciFact retrieval task.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published