Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix README.md typo #1124

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions FlagEmbedding/llm_dense_retriever/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,10 @@
- [x] Checkpoint
- [x] Training Data
- [x] Training Code
- [x] Technical Report
- [ ] Evaluation Pipeline
- [ ] Technical Report

We will release the technical report for **BGE-EN-ICL** in the future.
The technical report for **BGE-EN-ICL** can be found in [Making Text Embedders Few-Shot Learners](https://arxiv.org/abs/2409.15700)

## Environment
```bash
Expand All @@ -39,7 +39,7 @@ pip install flash-attn --no-build-isolation

| Data | Introduction |
| ------------------------------------------------------------ | ------------------------------------------------------------ |
| [e5-data](https://huggingface.co/datasets/cfli/bge-e5data) | Public data identical to [e5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct) |
| [public-data](https://huggingface.co/datasets/cfli/bge-e5data) | Public data identical to [e5-mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct) |
| [full-data](https://huggingface.co/datasets/cfli/bge-full-data) | The full dataset we used for training |

## Usage
Expand Down Expand Up @@ -219,13 +219,13 @@ If you find this repository useful, please give us a star ⭐.
To cite our work:

```
@misc{li2023makinglargelanguagemodels,
title={Making Large Language Models A Better Foundation For Dense Retrieval},
author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},
year={2023},
eprint={2312.15503},
@misc{li2024makingtextembeddersfewshot,
title={Making Text Embedders Few-Shot Learners},
author={Chaofan Li and MingHao Qin and Shitao Xiao and Jianlyu Chen and Kun Luo and Yingxia Shao and Defu Lian and Zheng Liu},
year={2024},
eprint={2409.15700},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2312.15503},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2409.15700},
}
```
5 changes: 3 additions & 2 deletions FlagEmbedding/visual/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ def __init__(self,
self.to(self.device)
else:
self.device = torch.device('cpu')
self.dtype = next(bge.parameters()).dtype

def load_model(self, model_weight):
self.load_state_dict(torch.load(model_weight, map_location='cpu'))
Expand Down Expand Up @@ -191,7 +192,7 @@ def encode_text(self, texts):
token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)

head_mask = [None] * self.depth
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape).to(self.dtype)

embedding_output = self.bge_embeddings(
input_ids=input_ids,
Expand Down Expand Up @@ -270,7 +271,7 @@ def encode_mm(self, images:torch.Tensor, texts):
prom_img_input_shape = prompt_img_embedding.size()

head_mask = [None] * self.depth
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(prom_img_attention_mask, prom_img_input_shape).to(prompt_img_embedding.dtype)
extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(prom_img_attention_mask, prom_img_input_shape).to(self.dtype)


encoder_outputs = self.bge_encoder(
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ FlagEmbedding focuses on retrieval-augmented LLMs, consisting of the following p
- **Benchmark**: [C-MTEB](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB), [AIR-Bench](https://github.com/AIR-Bench/AIR-Bench), [MLVU](https://github.com/JUNJIE99/MLVU)

## News
- 9/10/2014: Introducing **MemoRAG**, a step forward towards RAG 2.0 on top of memory-inspired knowledge discovery (repo: https://github.com/qhjqhj00/MemoRAG, paper: https://arxiv.org/pdf/2409.05591v1) :fire:
- 9/10/2024: Introducing **MemoRAG**, a step forward towards RAG 2.0 on top of memory-inspired knowledge discovery (repo: https://github.com/qhjqhj00/MemoRAG, paper: https://arxiv.org/pdf/2409.05591v1) :fire:
- 9/2/2024: Start to maintain the [tutorials](./Tutorials/). The contents within will be actively updated and eariched, stay tuned! :books:
- 7/26/2024: Release a new embedding model [bge-en-icl](https://huggingface.co/BAAI/bge-en-icl), an embedding model that incorporates in-context learning capabilities, which, by providing task-relevant query-response examples, can encode semantically richer queries, further enhancing the semantic representation ability of the embeddings. :fire:
- 7/26/2024: Release a new embedding model [bge-multilingual-gemma2](https://huggingface.co/BAAI/bge-multilingual-gemma2), a multilingual embedding model based on gemma-2-9b, which supports multiple languages and diverse downstream tasks, achieving new SOTA on multilingual benchmarks (MIRACL, MTEB-fr, and MTEB-pl). :fire:
Expand Down
44 changes: 22 additions & 22 deletions Tutorials/2_Similarity/similarity.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"tensor([[3., 2., 3., 6.]]) tensor([[2., 4., 4., 1.]])\n"
"tensor([[5., 2., 2., 6.]]) tensor([[4., 6., 6., 4.]])\n"
]
}
],
Expand Down Expand Up @@ -239,7 +239,7 @@
{
"data": {
"text/plain": [
"5.5677642822265625"
"6.082762718200684"
]
},
"execution_count": 6,
Expand Down Expand Up @@ -273,7 +273,7 @@
{
"data": {
"text/plain": [
"5.5677642822265625"
"6.082762718200684"
]
},
"execution_count": 7,
Expand Down Expand Up @@ -382,7 +382,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"0.6907725930213928\n"
"0.802726686000824\n"
]
}
],
Expand Down Expand Up @@ -439,7 +439,7 @@
"name": "stdout",
"output_type": "stream",
"text": [
"0.6907725930213928\n"
"0.802726686000824\n"
]
}
],
Expand Down Expand Up @@ -487,7 +487,7 @@
{
"data": {
"text/plain": [
"0.6907725930213928"
"0.802726686000824"
]
},
"execution_count": 11,
Expand Down Expand Up @@ -530,7 +530,7 @@
{
"data": {
"text/plain": [
"32.0"
"68.0"
]
},
"execution_count": 12,
Expand Down Expand Up @@ -563,31 +563,31 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 14,
"id": "e0f40534",
"metadata": {},
"outputs": [],
"source": [
"from FlagEmbedding import FlagModel\n",
"\n",
"model = FlagModel('BAAI/bge-base-en-v1.5',\n",
"model = FlagModel('BAAI/bge-large-en-v1.5',\n",
" query_instruction_for_retrieval=\"Represent this sentence for searching relevant passages:\",\n",
" use_fp16=True)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 15,
"id": "78445a86",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9999999403953552"
"1.0"
]
},
"execution_count": 21,
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
Expand Down Expand Up @@ -616,7 +616,7 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 16,
"id": "73012cbb",
"metadata": {},
"outputs": [],
Expand Down Expand Up @@ -648,7 +648,7 @@
},
{
"cell_type": "code",
"execution_count": 23,
"execution_count": 17,
"id": "98bfcc6d",
"metadata": {},
"outputs": [
Expand Down Expand Up @@ -684,15 +684,15 @@
},
{
"cell_type": "code",
"execution_count": 24,
"execution_count": 18,
"id": "426c0b42",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"torch.Size([1, 768])\n"
"torch.Size([1, 1024])\n"
]
}
],
Expand All @@ -715,16 +715,16 @@
},
{
"cell_type": "code",
"execution_count": 25,
"execution_count": 19,
"id": "d9bb35cf",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.49946388602256775\n",
"0.6032702922821045\n"
"0.714613139629364\n",
"0.5931472182273865\n"
]
}
],
Expand All @@ -745,16 +745,16 @@
},
{
"cell_type": "code",
"execution_count": 26,
"execution_count": 20,
"id": "29e70bbc",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.8752679824829102\n",
"0.8180326223373413\n"
"0.7446640729904175\n",
"0.8240882158279419\n"
]
}
],
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Indexing"
"# Indexing Using Faiss"
]
},
{
Expand Down
Loading