Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

Open
tmostak opened this issue Aug 8, 2024 · 0 comments
Open

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS #375

tmostak opened this issue Aug 8, 2024 · 0 comments

Comments

@tmostak
Copy link

tmostak commented Aug 8, 2024

Feature request

We'd like to run the Alibaba-NLP/gte-large-en-v1.5 model on a CPU text-embedding-router server, but are hitting

Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

Is there any way to implement/allow this model to run on CPU?

Motivation

For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5 model to avail ourselves of the long 8192 token context length.

Your contribution

We'd be happy to test and run performance benchmarks if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant