-
Notifications
You must be signed in to change notification settings - Fork 254
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Cherry pick Habana software 1.18.0 update (#2025)
Signed-off-by: xinhe3 <xinhe3@habana.ai> Signed-off-by: Yi Liu <yiliu4@habana.ai> Signed-off-by: Sun, Xuehao <xuehao.sun@intel.com> Signed-off-by: chensuyue <suyue.chen@intel.com> Co-authored-by: yan tomsinsky <ytomsinsky@habana.ai> Co-authored-by: Uri Livne <ulivne@habana.ai> Co-authored-by: Dudi Lester <dlester@habana.ai> Co-authored-by: Danny <dsemiat@habana.ai> Co-authored-by: Tomer Gafni <tgafni@habana.ai> Co-authored-by: Eran Geva <egeva@habana.ai> Co-authored-by: Daniel Ohayon <danielohayon444@gmail.com> Co-authored-by: Roi Tiefenbrunn <rtiefenbrunn@habana.ai> Co-authored-by: Kamil Felskowski <kfelskowskix@habana.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
d6149aa
commit 5fb2184
Showing
67 changed files
with
99,756 additions
and
945 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,4 @@ neural-compressor | |
lm_eval==0.4.3 | ||
peft | ||
optimum-intel | ||
intel_extension_for_pytorch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,3 +10,4 @@ einops | |
neural-compressor | ||
lm_eval==0.4.3 | ||
peft | ||
intel_extension_for_pytorch |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,5 @@ tiktoken #qwen | |
einops #qwen | ||
auto_round | ||
lm-eval==0.4.3 | ||
numba | ||
tbb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,5 @@ einops #qwen | |
auto_round | ||
lm-eval==0.4.3 | ||
huggingface_hub | ||
numba | ||
tbb |
183 changes: 50 additions & 133 deletions
183
...rch/nlp/huggingface_models/language-modeling/quantization/weight_only/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,179 +1,96 @@ | ||
Step-by-Step | ||
============ | ||
This document describes the step-by-step instructions to run large language models (LLMs) on 4th Gen Intel® Xeon® Scalable Processor (codenamed Sapphire Rapids) with PyTorch and Intel® Extension for PyTorch. | ||
Weight-only quantization | ||
=============== | ||
|
||
The script `run_clm_no_trainer.py` supports `GPTJ`, `OPT`, `LLaMA2`, `BLOOM` and `Falcon` quantization and validates last word prediction accuracy with [lm_eval](https://github.com/EleutherAI/lm-evaluation-harness.git) now, and we are adding more models. | ||
|
||
# Prerequisite | ||
## 1. Create Environment | ||
## Prerequisite | ||
``` | ||
# Installation | ||
pip install -r requirements.txt | ||
``` | ||
|
||
# Run | ||
## Support status on HPU | ||
|
||
Here is how to run the scripts: | ||
Below is the current support status on Intel Gaudi AI Accelerator with PyTorch. | ||
|
||
**Causal Language Modeling (CLM)** | ||
| woq_algo | Status | | ||
|--------------|----------| | ||
| GPTQ | ✔| | ||
|
||
`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates `lambada_openai`, `piqa`, `winogrande`, `hellaswag` and other datasets accuracy provided by lm_eval, an example command is as follows. | ||
### GPT-J-6b | ||
> We validated the typical LLMs such as: `meta-llama/Llama-2-7b-hf`, `EleutherAI/gpt-j-6B`, `facebook/opt-125m`. | ||
#### Quantization | ||
## Support status on CPU | ||
|
||
```bash | ||
# "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo GPTQ \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--gptq_max_seq_length 2048 \ | ||
--gptq_use_max_length \ | ||
--double_quant_type "BNB_NF4" \ | ||
--output_dir saved_results | ||
Below is the current support status on Intel® Xeon® Scalable Processor with PyTorch. | ||
|
||
# "--woq_algo RTN" is used to enable RTN algorithms | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo RTN \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--double_quant_type "BNB_NF4" | ||
--output_dir saved_results | ||
|
||
# "--woq_algo AWQ" is used to enable AWQ algorithms | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo AWQ \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--calib_iters 128 | ||
| woq_algo | status | | ||
|--------------|----------| | ||
| RTN | ✔ | | ||
| GPTQ | ✔ | | ||
| AutoRound| ✔ | | ||
| AWQ | ✔ | | ||
| TEQ | ✔ | | ||
|
||
# "--woq_algo AutoRound" is used to enable AutoRound algorithms | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo AutoRound \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 | ||
> We validated the typical LLMs such as: `meta-llama/Llama-2-7b-hf`, `EleutherAI/gpt-j-6B`, `facebook/opt-125m`. | ||
# "--accuracy" for eval | ||
python run_clm_no_trainer.py \ | ||
--model EleutherAI/gpt-j-6B \ | ||
--dataset NeelNanda/pile-10k \ | ||
--int8 \ | ||
--accuracy \ | ||
--tasks "lambada_openai" \ | ||
--output_dir saved_results | ||
``` | ||
**Notes**: Weight-only quantization based on fake quantization is previewly supported and supports RTN, GPTQ[1], AWQ[2], TEQ algorithms. For more details, please refer to [link](https://github.com/intel/neural-compressor/blob/master/docs/source/quantization_weight_only.md). Our GPTQ API support various CLMs including GPTJ, OPTs, Blooms, Llamas, Falcons, MPTs, ChatGLMs, etc. Simply replace the "--model" argument with other models to quantize different CLMs with GPTQ. | ||
|
||
## Run | ||
|
||
### OPT-125m | ||
`run_clm_no_trainer.py` quantizes the large language models using the dataset [NeelNanda/pile-10k](https://huggingface.co/datasets/NeelNanda/pile-10k) calibration and validates datasets accuracy provided by lm_eval, an example command is as follows. | ||
|
||
#### Quantization | ||
### Quantization | ||
|
||
```bash | ||
# "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--batch_size 8 \ | ||
--woq_algo GPTQ \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--gptq_max_seq_length 2048 \ | ||
--gptq_use_max_length \ | ||
--double_quant_type "BNB_NF4" | ||
|
||
# "--woq_algo RTN" is used to enable RTN algorithms | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo RTN \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--double_quant_type "BNB_NF4" | ||
|
||
# "--woq_algo AWQ" is used to enable AWQ algorithms | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo AWQ \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--calib_iters 128 | ||
--output_dir saved_results | ||
``` | ||
### Evaluation | ||
|
||
# "--woq_algo AutoRound" is used to enable AutoRound algorithms | ||
```bash | ||
# original model | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo AutoRound \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--accuracy \ | ||
--batch_size 8 \ | ||
--tasks "lambada_openai,wikitext" \ | ||
--output_dir saved_results | ||
|
||
# "--accuracy" for eval | ||
# quantized model | ||
python run_clm_no_trainer.py \ | ||
--model facebook/opt-125m \ | ||
--dataset NeelNanda/pile-10k \ | ||
--int8 \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--load \ | ||
--accuracy \ | ||
--tasks "lambada_openai" \ | ||
--batch_size 8 \ | ||
--tasks "lambada_openai,wikitext" \ | ||
--output_dir saved_results | ||
``` | ||
|
||
### LLAMA2-7b/13b/70b | ||
#### Quantization | ||
### Benchmark | ||
|
||
```bash | ||
# "--double_quant_type BNB_NF4" is used to enable double quant algorithms | ||
# "--woq_algo GPTQ" is used to enable GPTQ algorithms | ||
# original model | ||
python run_clm_no_trainer.py \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo GPTQ \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--gptq_max_seq_length 2048 \ | ||
--gptq_use_max_length \ | ||
--double_quant_type "BNB_NF4" | ||
--performance \ | ||
--batch_size 8 \ | ||
--output_dir saved_results | ||
|
||
# "--woq_algo RTN" is used to enable RTN algorithms | ||
# quantized model | ||
python run_clm_no_trainer.py \ | ||
--model meta-llama/Llama-2-7b-hf \ | ||
--dataset NeelNanda/pile-10k \ | ||
--quantize \ | ||
--woq_algo RTN \ | ||
--woq_bits 4 \ | ||
--woq_scheme asym \ | ||
--woq_group_size 128 \ | ||
--double_quant_type "BNB_NF4" | ||
--load \ | ||
--performance \ | ||
--batch_size 8 \ | ||
--output_dir saved_results | ||
``` | ||
|
||
|
||
[1]. Elias, Frantar, et al. "GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers." arXiv preprint arXiv:2210.17323 (2023). | ||
[2]. Lin, Ji, et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint arXiv:2306.00978 (2023). | ||
For more information about parameter usage, please refer to [PT_WeightOnlyQuant.md](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/PT_WeightOnlyQuant.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,4 +11,5 @@ neural-compressor | |
lm_eval==0.4.3 | ||
peft | ||
auto_round | ||
intel_extension_for_pytorch | ||
numba | ||
tbb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.