A simple semantic search application that takes in a text string and images, passes it through a Visual Language Model to generate text and image-embeddings and provides the most similar image from the image embeddings using K-nearest neighbours search.
Cool things about this project :
- Switchable Visual Language Model Encoders via hf transformers. Currently Supporting:
-
All CLiP Versions: Tested : "openai/clip-vit-base-patch32"
-
All BLiP Versions: Tested : "Salesforce/blip2-opt-2.7b"
-
Quantized CLiP versions for resource constrained systems via clip.cpp
-
Work In Progress
- Fast Vector Search on pre-computed embeddings with FAISS
You would need a conda environment to install the dependencies
sudo apt install miniconda
Create new conda environment and install dependencies
conda env create --name vlss --file=environment.yml
conda activate vlss
streamlit run demo.py
This project is inspired from my older project on visual place recognition. This project wouldn't have been possible without the existence of the following open-source libraries: