How to run int4?
I saw this paragraph in README.md
:
We provide code for on-the-fly int4 quantization which minimizes performance degradation as well.
However, I couldn't find any code or script related to this.
Does just loading the model with a bnb config
automatically enable int4? Or is there something else I need to do?
Transformers comes with BitsAndBytes. When you load the model using .from_pretrained(...,quantisation_config=BitsAndBytes(...)) - just pass a BitsAndBytes configuration object for int8 or int4 with the specific arguments. You can additionally off-load some of the layers to free-up room in VRAM. Alternatively, you can use a config_map to assign layers. I'd use BitsAndBytes and set device_map="auto" and let transformers set to device_mapping for the layers.