Skip to content

Commit

Permalink
v1.0.0 (#168)
Browse files Browse the repository at this point in the history
  • Loading branch information
OlivierDehaene authored Feb 23, 2024
1 parent f1e50df commit 41b692d
Show file tree
Hide file tree
Showing 11 changed files with 44 additions and 44 deletions.
14 changes: 7 additions & 7 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ members = [
resolver = "2"

[workspace.package]
version = "0.6.0"
version = "1.0.0"
edition = "2021"
authors = ["Olivier Dehaene"]
homepage = "https://github.com/huggingface/text-embeddings-inference"
Expand Down
28 changes: 14 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,10 +67,10 @@ with absolute positions in `text-embeddings-inference`.

Examples of supported models:

| MTEB Rank | Model Type | Example Model ID |
| MTEB Rank | Model Type | Model ID |
|-----------|-------------|--------------------------------------------------------------------------------------------------|
| 6 | Bert | [WhereIsAI/UAE-Large-V1](https://hf.co/WhereIsAI/UAE-Large-V1) |
| 1O | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
| 10 | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1](https://hf.co/nomic-ai/nomic-embed-text-v1) |
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1.5](https://hf.co/nomic-ai/nomic-embed-text-v1.5) |
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) |
Expand All @@ -80,7 +80,7 @@ models [here](https://huggingface.co/spaces/mteb/leaderboard).

#### Sequence Classification and Re-Ranking

`text-embeddings-inference` v0.4.0 added support for CamemBERT, RoBERTa and XLM-RoBERTa Sequence Classification models.
`text-embeddings-inference` v0.4.0 added support for Bert, CamemBERT, RoBERTa and XLM-RoBERTa Sequence Classification models.

Example of supported sequence classification models:

Expand All @@ -97,7 +97,7 @@ model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model --revision $revision
```

And then you can make requests like
Expand Down Expand Up @@ -242,13 +242,13 @@ Text Embeddings Inference ships with multiple Docker images that you can use to

| Architecture | Image |
|-------------------------------------|-------------------------------------------------------------------------|
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 |
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.0 |
| Volta | NOT SUPPORTED |
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-0.6 (experimental) |
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.6 |
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.6 |
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.6 |
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-0.6 (experimental) |
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.0 (experimental) |
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.0 |
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.0 |
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.0 |
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.0 (experimental) |

**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.
Expand Down Expand Up @@ -277,7 +277,7 @@ model=<your private model>
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model
```

### Using Re-rankers models
Expand All @@ -295,7 +295,7 @@ model=BAAI/bge-reranker-large
revision=refs/pr/4
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model --revision $revision
```

And then you can rank the similarity between a query and a list of texts with:
Expand All @@ -315,7 +315,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
model=SamLowe/roberta-base-go_emotions
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model
```

Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
Expand Down Expand Up @@ -344,7 +344,7 @@ model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6-grpc --model-id $model --revision $revision
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0-grpc --model-id $model --revision $revision
```

```shell
Expand Down
2 changes: 1 addition & 1 deletion docs/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"license": {
"name": "HFOIL"
},
"version": "0.6.0"
"version": "1.0.0"
},
"paths": {
"/embed": {
Expand Down
6 changes: 3 additions & 3 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ TEI offers multiple features tailored to optimize the deployment process and enh

**Key Features:**

* **Streamlined Deployment:** TEI eliminates the need for a model graph compilation step for a more efficient deployment process.
* **Streamlined Deployment:** TEI eliminates the need for a model graph compilation step for an easier deployment process.
* **Efficient Resource Utilization:** Benefit from small Docker images and rapid boot times, allowing for true serverless capabilities.
* **Dynamic Batching:** TEI incorporates token-based dynamic batching thus optimizing resource utilization during inference.
* **Optimized Inference:** TEI leverages [Flash Attention](https://github.com/HazyResearch/flash-attention), [Candle](https://github.com/huggingface/candle), and [cuBLASLt](https://docs.nvidia.com/cuda/cublas/#using-the-cublaslt-api) by using optimized transformers code for inference.
* **Safetensors weight loading:** TEI loads [Safetensors](https://github.com/huggingface/safetensors) weights to enable tensor parallelism.
* **Production-Ready:** TEI supports distributed tracing through Open Telemetry and Prometheus metrics.
* **Safetensors weight loading:** TEI loads [Safetensors](https://github.com/huggingface/safetensors) weights for faster boot times.
* **Production-Ready:** TEI supports distributed tracing through Open Telemetry and exports Prometheus metrics.

**Benchmarks**

Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/local_cpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ You can install `text-embeddings-inference` locally to run it on your own machin

## Step 1: Install Rust

[Install Rust]((https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:
[Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/local_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ export PATH=$PATH:/usr/local/cuda/bin

## Step 2: Install Rust

[Install Rust]((https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:
[Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/local_metal.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Here are the step-by-step instructions for installation:

## Step 1: Install Rust

[Install Rust]((https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:
[Install Rust](https://rustup.rs/) on your machine by run the following in your terminal, then following the instructions:

```shell
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Expand Down
2 changes: 1 addition & 1 deletion docs/source/en/private_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,5 @@ model=<your private model>
volume=$PWD/data
token=<your cli Hugging Face Hub token>

docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model
```
6 changes: 3 additions & 3 deletions docs/source/en/quick_tour.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model --revision $revision
```

<Tip>
Expand Down Expand Up @@ -69,7 +69,7 @@ model=BAAI/bge-reranker-large
revision=refs/pr/4
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model --revision $revision
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model --revision $revision
```

Once you have deployed a model you can use the `rerank` endpoint to rank the similarity between a query and a list
Expand All @@ -90,7 +90,7 @@ You can also use classic Sequence Classification models like `SamLowe/roberta-ba
model=SamLowe/roberta-base-go_emotions
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:0.6 --model-id $model
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id $model
```

Once you have deployed the model you can use the `predict` endpoint to get the emotions most associated with an input:
Expand Down
22 changes: 11 additions & 11 deletions docs/source/en/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,10 +25,10 @@ model with Alibi positions.

Below are some examples of the currently supported models:

| MTEB Rank | Model Type | Example Model ID |
| MTEB Rank | Model Type | Model ID |
|-----------|-------------|--------------------------------------------------------------------------------------------------|
| 6 | Bert | [WhereIsAI/UAE-Large-V1](https://hf.co/WhereIsAI/UAE-Large-V1) |
| 1O | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
| 10 | XLM-RoBERTa | [intfloat/multilingual-e5-large-instruct](https://hf.co/intfloat/multilingual-e5-large-instruct) |
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1](https://hf.co/nomic-ai/nomic-embed-text-v1) |
| N/A | NomicBert | [nomic-ai/nomic-embed-text-v1.5](https://hf.co/nomic-ai/nomic-embed-text-v1.5) |
| N/A | JinaBERT | [jinaai/jina-embeddings-v2-base-en](https://hf.co/jinaai/jina-embeddings-v2-base-en) |
Expand Down Expand Up @@ -61,15 +61,15 @@ NVIDIA drivers with CUDA version 12.2 or higher.

Find the appropriate Docker image for your hardware in the following table:

| Architecture | Image |
|-------------------------------------|---------------------------------------------------------------------------|
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-0.6 |
| Volta | NOT SUPPORTED |
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-0.6 (experimental) |
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:0.6 |
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-0.6 |
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-0.6 |
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-0.4.0 (experimental) |
| Architecture | Image |
|-------------------------------------|--------------------------------------------------------------------------|
| CPU | ghcr.io/huggingface/text-embeddings-inference:cpu-1.0 |
| Volta | NOT SUPPORTED |
| Turing (T4, RTX 2000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:turing-1.0 (experimental) |
| Ampere 80 (A100, A30) | ghcr.io/huggingface/text-embeddings-inference:1.0 |
| Ampere 86 (A10, A40, ...) | ghcr.io/huggingface/text-embeddings-inference:86-1.0 |
| Ada Lovelace (RTX 4000 series, ...) | ghcr.io/huggingface/text-embeddings-inference:89-1.0 |
| Hopper (H100) | ghcr.io/huggingface/text-embeddings-inference:hopper-1.0 (experimental) |

**Warning**: Flash Attention is turned off by default for the Turing image as it suffers from precision issues.
You can turn Flash Attention v1 ON by using the `USE_FLASH_ATTENTION=True` environment variable.

0 comments on commit 41b692d

Please sign in to comment.