Skip to content

Commit

Permalink
Ken add llm examples (#461)
Browse files Browse the repository at this point in the history
* LLM QA Bot with Langchain and Chroma

* Add LLM finetuning

* Added commentary, eliminated tracer, made appropriate edits for wandb.login, environment variables, etc.

* Made suggested edits for wandb.login and commentary for HF integration

* fixed import and added WandBTracer from Langchain

* Delete W&B_LLM_QA_Bot.ipynb - wrong path

* Fixed import path and added correct WandbTracer

* add tracker

* rename and delete ducplicate

* clean up

---------

Co-authored-by: Thomas Capelle <tcapelle@pm.me>
  • Loading branch information
kenleejr and tcapelle authored Sep 18, 2023
1 parent 2d51cd6 commit 6e9aab7
Show file tree
Hide file tree
Showing 2 changed files with 1,000 additions and 0 deletions.
379 changes: 379 additions & 0 deletions colabs/huggingface/LLM_Finetuning_Notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,379 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/wandb/examples/blob/master/colabs/huggingface/LLM_Finetuning_Notebook.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
"<!--- @wandbcode{llm-finetune-hf} -->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# LLM Finetuning with HuggingFace and Weights and Biases\n",
"<!--- @wandbcode{llm-finetune-hf} -->\n",
"- Fine-tune a lightweight LLM (OPT-125M) with LoRA and 8-bit quantization using Launch\n",
"- Checkpoint the LoRA adapter weights as artifacts\n",
"- Link the best checkpoint in Model Registry\n",
"- Run inference on a quantized model\n",
"\n",
"The same workflow and principles from this notebook can be applied to fine-tuning some of the stronger OSS LLMs (e.g. Llama2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine-tune large models using 🤗 `peft` adapters, `transformers` & `bitsandbytes`\n",
"\n",
"In this tutorial we will cover how we can fine-tune large language models using the very recent `peft` library and `bitsandbytes` for loading large models in 8-bit.\n",
"The fine-tuning method will rely on a recent method called \"Low Rank Adapters\" (LoRA), instead of fine-tuning the entire model you just have to fine-tune these adapters and load them properly inside the model.\n",
"After fine-tuning the model you can also share your adapters on the 🤗 Hub and load them very easily. Let's get started!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Install requirements\n",
"\n",
"First, run the cells below to install the requirements:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install -q bitsandbytes datasets accelerate loralib\n",
"!pip install -q git+https://github.com/huggingface/transformers.git@main git+https://github.com/huggingface/peft.git\n",
"!pip install -q wandb\n",
"!pip install -q ctranslate2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Model Loading\n",
"\n",
"- Here we leverage 8-bit quantization to reduce the memory footprint of the model during training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"\n",
"import torch\n",
"import torch.nn as nn\n",
"import bitsandbytes as bnb\n",
"from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM\n",
"\n",
"model = AutoModelForCausalLM.from_pretrained(\n",
" \"facebook/opt-125m\",\n",
" load_in_8bit=True,\n",
" device_map='auto',\n",
")\n",
"\n",
"tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-125m\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Post-processing on the model\n",
"\n",
"Finally, we need to apply some post-processing on the 8-bit model to enable training, let's freeze all our layers, and cast the layer-norm in `float32` for stability. We also cast the output of the last layer in `float32` for the same reasons."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for param in model.parameters():\n",
" param.requires_grad = False # freeze the model - train adapters later\n",
" if param.ndim == 1:\n",
" # cast the small parameters (e.g. layernorm) to fp32 for stability\n",
" param.data = param.data.to(torch.float32)\n",
"\n",
"model.gradient_checkpointing_enable() # reduce number of stored activations\n",
"model.enable_input_require_grads()\n",
"\n",
"class CastOutputToFloat(nn.Sequential):\n",
" def forward(self, x): return super().forward(x).to(torch.float32)\n",
"model.lm_head = CastOutputToFloat(model.lm_head)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Apply LoRA\n",
"\n",
"Here comes the magic with `peft`! Let's load a `PeftModel` and specify that we are going to use low-rank adapters (LoRA) using `get_peft_model` utility function from `peft`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def print_trainable_parameters(model):\n",
" \"\"\"\n",
" Prints the number of trainable parameters in the model.\n",
" \"\"\"\n",
" trainable_params = 0\n",
" all_param = 0\n",
" for _, param in model.named_parameters():\n",
" all_param += param.numel()\n",
" if param.requires_grad:\n",
" trainable_params += param.numel()\n",
" print(\n",
" f\"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}\"\n",
" )"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from peft import LoraConfig, get_peft_model\n",
"\n",
"config = LoraConfig(\n",
" r=16,\n",
" lora_alpha=32,\n",
" target_modules=[\"q_proj\", \"v_proj\"],\n",
" lora_dropout=0.05,\n",
" bias=\"none\",\n",
" task_type=\"CAUSAL_LM\"\n",
")\n",
"\n",
"model = get_peft_model(model, config)\n",
"print_trainable_parameters(model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"- [W&B HuggingFace integration](https://docs.wandb.ai/guides/integrations/huggingface) automatically tracks important metrics during the course of training\n",
"- Also track the HF checkpoints as artifacts and register them in the model registry!\n",
"- Change the number of steps to 200+ for real results!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import transformers\n",
"from datasets import load_dataset\n",
"import wandb\n",
"\n",
"project_name = \"llm-finetuning\" #@param\n",
"entity = \"wandb\" #@param\n",
"os.environ[\"WANDB_LOG_MODEL\"] = \"checkpoint\"\n",
"\n",
"wandb.init(project=project_name,\n",
" entity=entity,\n",
" job_type=\"training\")\n",
"\n",
"data = load_dataset(\"Abirate/english_quotes\")\n",
"data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)\n",
"\n",
"trainer = transformers.Trainer(\n",
" model=model,\n",
" train_dataset=data['train'],\n",
" args=transformers.TrainingArguments(\n",
" per_device_train_batch_size=4,\n",
" gradient_accumulation_steps=4,\n",
" report_to=\"wandb\",\n",
" warmup_steps=5,\n",
" max_steps=25,\n",
" learning_rate=2e-4,\n",
" fp16=True,\n",
" logging_steps=1,\n",
" save_steps=5,\n",
" output_dir='outputs'\n",
" ),\n",
" data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)\n",
")\n",
"model.config.use_cache = False # silence the warnings. Please re-enable for inference!\n",
"trainer.train()\n",
"wandb.finish()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Adding Model Weights to W&B Model Registry\n",
"- Here we get our best checkpoint from the finetuning run and register it as our best model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"last_run_id = \"zz0lxkc8\" #@param\n",
"wandb.init(project=project_name, entity=entity, job_type=\"registering_best_model\")\n",
"best_model = wandb.use_artifact(f'{entity}/{project_name}/checkpoint-{last_run_id}:latest')\n",
"registered_model_name = \"OPT-125M-english\" #@param {type: \"string\"}\n",
"wandb.run.link_artifact(best_model, f'{entity}/model-registry/{registered_model_name}', aliases=['staging'])\n",
"wandb.finish()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Consuming Model From Registry and Quantizing using ctranslate2\n",
"- LLMs are typically too large to run in full-precision on even decent hardware.\n",
"- You can quantize the model to run it more efficiently with minimal loss in accuracy.\n",
" - CTranslate2 is a great first pass at quantization but doesn't do \"smart\" quantization. It just converts all weights to half precision.\n",
" - Checkout out GPTQ and AutoGPTQ for SOTA quantization at scale"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Pull model from the registry\n",
"\n",
"wandb.init(project=project_name, entity=entity, job_type=\"ctranslate2\")\n",
"best_model = wandb.use_artifact(f'{entity}/model-registry/{registered_model_name}:latest')\n",
"best_model.download(root=f'model-registry/{registered_model_name}:latest')\n",
"wandb.finish()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from peft import PeftModel, PeftConfig\n",
"\n",
"def convert_qlora2ct2(adapter_path=f'model-registry/{registered_model_name}:latest',\n",
" full_model_path=\"opt125m-finetuned\",\n",
" offload_path=\"opt125m-offload\",\n",
" ct2_path=\"opt125m-finetuned-ct2\",\n",
" quantization=\"int8\"):\n",
"\n",
"\n",
" peft_model_id = adapter_path\n",
" peftconfig = PeftConfig.from_pretrained(peft_model_id)\n",
"\n",
" model = AutoModelForCausalLM.from_pretrained(\n",
" \"facebook/opt-125m\",\n",
" offload_folder = offload_path,\n",
" device_map='auto',\n",
" )\n",
"\n",
" tokenizer = AutoTokenizer.from_pretrained(\"facebook/opt-125m\")\n",
"\n",
" model = PeftModel.from_pretrained(model, peft_model_id)\n",
"\n",
" print(\"Peft model loaded\")\n",
"\n",
" merged_model = model.merge_and_unload()\n",
"\n",
" merged_model.save_pretrained(full_model_path)\n",
" tokenizer.save_pretrained(full_model_path)\n",
"\n",
" if quantization == False:\n",
" os.system(f\"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --force\")\n",
" else:\n",
" os.system(f\"ct2-transformers-converter --model {full_model_path} --output_dir {ct2_path} --quantization {quantization} --force\")\n",
" print(\"Convert successfully\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"convert_qlora2ct2(adapter_path=f'model-registry/{registered_model_name}:latest')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Run Inference Using Quantized CTranslate2 Model\n",
"- Record the results in a W&B Table!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ctranslate2\n",
"\n",
"\n",
"run = wandb.init(project=project_name, entity=entity, job_type=\"inference\")\n",
"generator = ctranslate2.Generator(\"opt125m-finetuned-ct2\")\n",
"\n",
"prompts = [\"Hey, are you conscious? Can you talk to me?\",\n",
" \"What is machine learning?\",\n",
" \"What is W&B?\"]\n",
"\n",
"\n",
"wandb_table = wandb.Table(columns=['prompt', 'completion'])\n",
"for prompt in prompts:\n",
" start_tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))\n",
" results = generator.generate_batch([start_tokens], max_length=30)\n",
" output = tokenizer.decode(results[0].sequences_ids[0])\n",
" wandb_table.add_data(prompt, output)\n",
"\n",
"wandb.log({\"inference_table\": wandb_table})\n",
"wandb.finish()"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"include_colab_link": true,
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading

0 comments on commit 6e9aab7

Please sign in to comment.