How to run int4?

#19
by BootsofLagrangian - opened

I saw this paragraph in README.md:

We provide code for on-the-fly int4 quantization which minimizes performance degradation as well.

However, I couldn't find any code or script related to this.

Does just loading the model with a bnb config automatically enable int4? Or is there something else I need to do?

Transformers comes with BitsAndBytes. When you load the model using .from_pretrained(...,quantisation_config=BitsAndBytes(...)) - just pass a BitsAndBytes configuration object for int8 or int4 with the specific arguments. You can additionally off-load some of the layers to free-up room in VRAM. Alternatively, you can use a config_map to assign layers. I'd use BitsAndBytes and set device_map="auto" and let transformers set to device_mapping for the layers.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment