|
This is the [inceptionai/jais-family-13b](https://huggingface.co/inceptionai/jais-family-13b) model converted to [OpenVINO](https://docs.openvino.ai/2025/index.html) |
|
with INT4 weight compression. |
|
|
|
## Download the model |
|
|
|
- Install huggingface-hub |
|
|
|
```sh |
|
pip install huggingface-hub[cli] |
|
``` |
|
|
|
- Download the model |
|
|
|
```sh |
|
huggingface-cli download helenai/jais-family-13b-ov-int4-sym --local-dir helenai/jais-family-13b-ov-int4-sym |
|
``` |
|
|
|
## Run inference |
|
|
|
- Install/upgrade OpenVINO GenAI nightly (this is the only requirement for inference - no need to install transformers or PyTorch) |
|
|
|
```sh |
|
pip install --pre --upgrade openvino-genai openvino openvino-tokenizers --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/pre-release |
|
``` |
|
|
|
- Download a sample inference script (curl -O works on Windows Command Prompt and most Linux terminals). Note that this is not a chat/instruct model; this inference script |
|
is only for testing model outputs. The model is not finetuned for answering questions, and the inference script will not remember history. |
|
|
|
```sh |
|
curl -O https://raw.githubusercontent.com/helena-intel/snippets/refs/heads/main/llm_chat/python/llm_test.py |
|
``` |
|
|
|
- Run the script with the path to the model and the device as parameters. Change GPU to CPU to run on CPU. NPU is not yet supported for this model |
|
|
|
```sh |
|
python llm_test.py jais-family-13b-ov-int4-sym GPU |
|
``` |
|
|
|
Check out [OpenVINO GenAI documentation](https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai.html) for more information. |
|
|
|
## Model compression parameters |
|
|
|
``` |
|
openvino_version : 2025.2.0-18660-3ceeeb52d64 |
|
|
|
advanced_parameters : {'statistics_path': None, 'awq_params': {'subset_size': 32, 'percent_to_apply': 0.002, 'alpha_min': 0.0, 'alpha_max': 1.0, 'steps': 100}, 'scale_estimation_params': {'subset_size': 64, 'initial_steps': 5, 'scale_steps': 5, 'weight_penalty': -1.0}, 'gptq_params': {'damp_percent': 0.1, 'block_size': 128, 'subset_size': 128}, 'lora_correction_params': {'adapter_rank': 8, 'num_iterations': 3, 'apply_regularization': True, 'subset_size': 128, 'use_int8_adapters': True}} |
|
all_layers : False |
|
awq : False |
|
backup_mode : int8_asym |
|
gptq : False |
|
group_size : -1 |
|
ignored_scope : [] |
|
lora_correction : False |
|
mode : int4_sym |
|
ratio : 1.0 |
|
scale_estimation : False |
|
sensitivity_metric : weight_quantization_error |
|
|
|
optimum_intel_version : 1.22.0 |
|
optimum_version : 1.24.0 |
|
pytorch_version : 2.5.1+cpu |
|
transformers_version : 4.48.3 |
|
``` |