Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Activation Quantization? #7

Open
gaokaiz2 opened this issue May 22, 2024 · 1 comment
Open

How to use Activation Quantization? #7

gaokaiz2 opened this issue May 22, 2024 · 1 comment

Comments

@gaokaiz2
Copy link

What I've done: I have used your main function to get weight-only quantized versions of LLMs (from local LLM safetensors to local LLM safetensors) and then use the quantized versions to evaluate on lm-evaluation-harness, which also takes an local path as the model. All works well and thanks for your repo!

What I need help with: I am not very sure how to correctly use your functions to use a activation-quantized version of LLMs.

What I've tried:

  1. directly use your main function to store activation-quantized version (this should not work because activation quantization should happen in run-time?)
  2. manually change the evaluation code s.t. when the model is first loaded, I replace it with your quantized_model(model, args) with kv_bit=16 and a_bit=4; this fails because there the models don't have the named_modules that you use in your code

May I get help with this issue? thx in advance!~

@wln20
Copy link
Contributor

wln20 commented May 22, 2024

Hi!

Thanks for your question. Unlike weight-only quantization that only needs to replace the weight data with its quantized counterpart (without modification of the model architecture), weight&activation (WA) quantization has to replace the whole linear module (i.e. nn.Linear) with our customized one (i.e. WALinear), as the quantization of activation should be performed on the input or output data. Therefore, it's not supported to directly load a WA quantized model with standard AutoModelForCausalLM.from_pretrained() function, which would utilize the standard modeling_XXX.py (XXX is the name of a certain model, e.g. llama) file to define the model architecture, while the standard modeling_XXX.py doesn't have a WALinear module.

So, if you want to directly load a WA quantized model from a local checkpoint, a customized modeling_XXX.py that explicitly defines the WALinear module in place of nn.Linear must be used. Unfortunately, we doesn't have a customized modeling_XXX.py now and the WALinear is dynamically added to the model during the execution of quantize_model(). Apologize for the inconvenience and we are considering to add a customized modeling file in the future, please stay tuned.

To correctly use WA quantization in your code, simply load the original full-precision model and use model = quantize_model(model, args) to get everything ready, then you can use the model for inference with weight and activation quantized!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants