Skip to content

Commit

Permalink
update xpu quant usage
Browse files Browse the repository at this point in the history
Signed-off-by: Kaihui-intel <kaihui.tang@intel.com>
  • Loading branch information
Kaihui-intel committed Oct 11, 2024
1 parent 2bb257e commit 08d321b
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 4 deletions.
1 change: 1 addition & 0 deletions docs/source/3x/transformers_like_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ python run_generation_gpu_woq.py --woq --benchmark --model save_dir
>Note:
> * Saving quantized model should be executed before the optimize_transformers function is called.
> * The optimize_transformers function is designed to optimize transformer-based models within frontend Python modules, with a particular focus on Large Language Models (LLMs). It provides optimizations for both model-wise and content-generation-wise. The detail of `optimize_transformers`, please refer to [the link](https://github.com/intel/intel-extension-for-pytorch/blob/xpu-main/docs/tutorials/llm/llm_optimize_transformers.md).
>* The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```.
## Examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ python run_generate_cpu_woq.py \
> 1. default search algorithm is beam search with num_beams = 1.
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well.
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device.
> 4. The quantization process is performed on the CPU accelerator by default. Users can override this setting by specifying the environment variable `INC_TARGET_DEVICE`. Usage on bash: ```export INC_TARGET_DEVICE=xpu```.
## Prerequisite​
### Dependencies
Expand Down
11 changes: 7 additions & 4 deletions neural_compressor/transformers/quantization/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,14 +347,17 @@ def run_fn_for_autoround(model, dataloader):


def convert_to_quantized_model(model, config, device="cpu"):

if device == "xpu" or device == torch.device("xpu"):
import intel_extension_for_pytorch

assert hasattr(torch, "xpu") and torch.xpu.is_available(), "There is no xpu device in this system!"
os.environ["INC_TARGET_DEVICE"] = "cpu"
logger.info(
"Set the environment variable INC_TARGET_DEVICE='cpu' to ensure the quantization process occurs on the CPU."
)
if "INC_TARGET_DEVICE" not in os.environ:
os.environ["INC_TARGET_DEVICE"] = "cpu"
logger.info(
"Set the environment variable INC_TARGET_DEVICE='cpu'"
" to ensure the quantization process occurs on the CPU."
)

orig_dtype = torch.float32
for param in model.parameters():
Expand Down

0 comments on commit 08d321b

Please sign in to comment.