-
Notifications
You must be signed in to change notification settings - Fork 291
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* LLM QA Bot with Langchain and Chroma * Add LLM finetuning * Added commentary, eliminated tracer, made appropriate edits for wandb.login, environment variables, etc. * Made suggested edits for wandb.login and commentary for HF integration * fixed import and added WandBTracer from Langchain * Delete W&B_LLM_QA_Bot.ipynb - wrong path * Fixed import path and added correct WandbTracer * add tracker * rename and delete ducplicate * clean up --------- Co-authored-by: Thomas Capelle <tcapelle@pm.me>
- Loading branch information
Showing
2 changed files
with
1,000 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,379 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<a href=\"https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/LLM_Finetuning_Notebook.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n", | ||
"<!--- @wandbcode{llm-finetune-hf} -->" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# LLM Finetuning with HuggingFace and Weights and Biases\n", | ||
"<!--- @wandbcode{llm-finetune-hf} -->\n", | ||
"- Fine-tune a lightweight LLM (OPT-125M) with LoRA and 8-bit quantization using Launch\n", | ||
"- Checkpoint the LoRA adapter weights as artifacts\n", | ||
"- Link the best checkpoint in Model Registry\n", | ||
"- Run inference on a quantized model\n", | ||
"\n", | ||
"The same workflow and principles from this notebook can be applied to fine-tuning some of the stronger OSS LLMs (e.g. Llama2)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`\n", | ||
"\n", | ||
"In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.\n", | ||
"The fine-tuning method will rely on a recent method called \"Low Rank Adapters\" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.\n", | ||
"After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Install requirements\n", | ||
"\n", | ||
"First, run the cells below to install the requirements:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install -q bitsandbytes datasets accelerate loralib\n", | ||
"!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git\n", | ||
"!pip install -q wandb\n", | ||
"!pip install -q ctranslate2" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Model Loading\n", | ||
"\n", | ||
"- Here we leverage 8-bit quantization to reduce the memory footprint of the model during training" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"\n", | ||
"import torch\n", | ||
"import torch.nn as nn\n", | ||
"import bitsandbytes as bnb\n", | ||
"from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM\n", | ||
"\n", | ||
"model = AutoModelForCausalLM.from_pretrained(\n", | ||
" \"facebook/opt-125m\",\n", | ||
" load_in_8bit=True,\n", | ||
" device_map='auto',\n", | ||
")\n", | ||
"\n", | ||
"tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-125m\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Post-processing on the model\n", | ||
"\n", | ||
"Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"for param in model.parameters():\n", | ||
" param.requires_grad = False # freeze the model - train adapters later\n", | ||
" if param.ndim == 1:\n", | ||
" # cast the small parameters (e.g. layernorm) to fp32 for stability\n", | ||
" param.data = param.data.to(torch.float32)\n", | ||
"\n", | ||
"model.gradient_checkpointing_enable() # reduce number of stored activations\n", | ||
"model.enable_input_require_grads()\n", | ||
"\n", | ||
"class CastOutputToFloat(nn.Sequential):\n", | ||
" def forward(self, x): return super().forward(x).to(torch.float32)\n", | ||
"model.lm_head = CastOutputToFloat(model.lm_head)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Apply LoRA\n", | ||
"\n", | ||
"Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"def print_trainable_parameters(model):\n", | ||
" \"\"\"\n", | ||
" Prints the number of trainable parameters in the model.\n", | ||
" \"\"\"\n", | ||
" trainable_params = 0\n", | ||
" all_param = 0\n", | ||
" for _, param in model.named_parameters():\n", | ||
" all_param += param.numel()\n", | ||
" if param.requires_grad:\n", | ||
" trainable_params += param.numel()\n", | ||
" print(\n", | ||
" f\"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}\"\n", | ||
" )" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from peft import LoraConfig, get_peft_model\n", | ||
"\n", | ||
"config = LoraConfig(\n", | ||
" r=16,\n", | ||
" lora_alpha=32,\n", | ||
" target_modules=[\"q_proj\", \"v_proj\"],\n", | ||
" lora_dropout=0.05,\n", | ||
" bias=\"none\",\n", | ||
" task_type=\"CAUSAL_LM\"\n", | ||
")\n", | ||
"\n", | ||
"model = get_peft_model(model, config)\n", | ||
"print_trainable_parameters(model)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"model" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Training\n", | ||
"- [W&B HuggingFace integration](https://docs.wandb.ai/guides/integrations/huggingface) automatically tracks important metrics during the course of training\n", | ||
"- Also track the HF checkpoints as artifacts and register them in the model registry!\n", | ||
"- Change the number of steps to 200+ for real results!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import transformers\n", | ||
"from datasets import load_dataset\n", | ||
"import wandb\n", | ||
"\n", | ||
"project_name = \"llm-finetuning\" #@param\n", | ||
"entity = \"wandb\" #@param\n", | ||
"os.environ[\"WANDB_LOG_MODEL\"] = \"checkpoint\"\n", | ||
"\n", | ||
"wandb.init(project=project_name,\n", | ||
" entity=entity,\n", | ||
" job_type=\"training\")\n", | ||
"\n", | ||
"data = load_dataset(\"Abirate/english_quotes\")\n", | ||
"data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)\n", | ||
"\n", | ||
"trainer = transformers.Trainer(\n", | ||
" model=model,\n", | ||
" train_dataset=data['train'],\n", | ||
" args=transformers.TrainingArguments(\n", | ||
" per_device_train_batch_size=4,\n", | ||
" gradient_accumulation_steps=4,\n", | ||
" report_to=\"wandb\",\n", | ||
" warmup_steps=5,\n", | ||
" max_steps=25,\n", | ||
" learning_rate=2e-4,\n", | ||
" fp16=True,\n", | ||
" logging_steps=1,\n", | ||
" save_steps=5,\n", | ||
" output_dir='outputs'\n", | ||
" ),\n", | ||
" data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)\n", | ||
")\n", | ||
"model.config.use_cache = False # silence the warnings. Please re-enable for inference!\n", | ||
"trainer.train()\n", | ||
"wandb.finish()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Adding Model Weights to W&B Model Registry\n", | ||
"- Here we get our best checkpoint from the finetuning run and register it as our best model" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"last_run_id = \"zz0lxkc8\" #@param\n", | ||
"wandb.init(project=project_name, entity=entity, job_type=\"registering_best_model\")\n", | ||
"best_model = wandb.use_artifact(f'{entity}/{project_name}/checkpoint-{last_run_id}:latest')\n", | ||
"registered_model_name = \"OPT-125M-english\" #@param {type: \"string\"}\n", | ||
"wandb.run.link_artifact(best_model, f'{entity}/model-registry/{registered_model_name}', aliases=['staging'])\n", | ||
"wandb.finish()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Consuming Model From Registry and Quantizing using ctranslate2\n", | ||
"- LLMs are typically too large to run in full-precision on even decent hardware.\n", | ||
"- You can quantize the model to run it more efficiently with minimal loss in accuracy.\n", | ||
" - CTranslate2 is a great first pass at quantization but doesn't do \"smart\" quantization. It just converts all weights to half precision.\n", | ||
" - Checkout out GPTQ and AutoGPTQ for SOTA quantization at scale" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Pull model from the registry\n", | ||
"\n", | ||
"wandb.init(project=project_name, entity=entity, job_type=\"ctranslate2\")\n", | ||
"best_model = wandb.use_artifact(f'{entity}/model-registry/{registered_model_name}:latest')\n", | ||
"best_model.download(root=f'model-registry/{registered_model_name}:latest')\n", | ||
"wandb.finish()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from peft import PeftModel, PeftConfig\n", | ||
"\n", | ||
"def convert_qlora2ct2(adapter_path=f'model-registry/{registered_model_name}:latest',\n", | ||
" full_model_path=\"opt125m-finetuned\",\n", | ||
" offload_path=\"opt125m-offload\",\n", | ||
" ct2_path=\"opt125m-finetuned-ct2\",\n", | ||
" quantization=\"int8\"):\n", | ||
"\n", | ||
"\n", | ||
" peft_model_id = adapter_path\n", | ||
" peftconfig = PeftConfig.from_pretrained(peft_model_id)\n", | ||
"\n", | ||
" model = AutoModelForCausalLM.from_pretrained(\n", | ||
" \"facebook/opt-125m\",\n", | ||
" offload_folder = offload_path,\n", | ||
" device_map='auto',\n", | ||
" )\n", | ||
"\n", | ||
" tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-125m\")\n", | ||
"\n", | ||
" model = PeftModel.from_pretrained(model, peft_model_id)\n", | ||
"\n", | ||
" print(\"Peft model loaded\")\n", | ||
"\n", | ||
" merged_model = model.merge_and_unload()\n", | ||
"\n", | ||
" merged_model.save_pretrained(full_model_path)\n", | ||
" tokenizer.save_pretrained(full_model_path)\n", | ||
"\n", | ||
" if quantization == False:\n", | ||
" os.system(f\"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --force\")\n", | ||
" else:\n", | ||
" os.system(f\"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --quantization {quantization} --force\")\n", | ||
" print(\"Convert successfully\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"convert_qlora2ct2(adapter_path=f'model-registry/{registered_model_name}:latest')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Run Inference Using Quantized CTranslate2 Model\n", | ||
"- Record the results in a W&B Table!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import ctranslate2\n", | ||
"\n", | ||
"\n", | ||
"run = wandb.init(project=project_name, entity=entity, job_type=\"inference\")\n", | ||
"generator = ctranslate2.Generator(\"opt125m-finetuned-ct2\")\n", | ||
"\n", | ||
"prompts = [\"Hey, are you conscious? Can you talk to me?\",\n", | ||
" \"What is machine learning?\",\n", | ||
" \"What is W&B?\"]\n", | ||
"\n", | ||
"\n", | ||
"wandb_table = wandb.Table(columns=['prompt', 'completion'])\n", | ||
"for prompt in prompts:\n", | ||
" start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))\n", | ||
" results = generator.generate_batch([start_tokens], max_length=30)\n", | ||
" output = tokenizer.decode(results[0].sequences_ids[0])\n", | ||
" wandb_table.add_data(prompt, output)\n", | ||
"\n", | ||
"wandb.log({\"inference_table\": wandb_table})\n", | ||
"wandb.finish()" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"accelerator": "GPU", | ||
"colab": { | ||
"include_colab_link": true, | ||
"provenance": [], | ||
"toc_visible": true | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"name": "python3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.