Update README.md
Browse files
README.md
CHANGED
@@ -18,8 +18,8 @@ tags:
|
|
18 |
- **Input:** Text
|
19 |
- **Output:** Text
|
20 |
- **Model Optimizations:**
|
21 |
-
- **Activation quantization:**
|
22 |
-
- **Weight quantization:**
|
23 |
- **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
|
24 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
|
25 |
- **Release Date:** 11/27/2024
|
@@ -32,13 +32,13 @@ It achieves an average score of 71.00 on the [OpenLLM](https://huggingface.co/sp
|
|
32 |
|
33 |
### Model Optimizations
|
34 |
|
35 |
-
This model was obtained by quantizing the weights of [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) to
|
36 |
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
|
37 |
Weight quantization also reduces disk size requirements by approximately 50%.
|
38 |
|
39 |
Only weights and activations of the linear operators within transformers blocks are quantized.
|
40 |
-
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between
|
41 |
-
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between
|
42 |
|
43 |
## Deployment
|
44 |
|
|
|
18 |
- **Input:** Text
|
19 |
- **Output:** Text
|
20 |
- **Model Optimizations:**
|
21 |
+
- **Activation quantization:** FP8
|
22 |
+
- **Weight quantization:** FP8
|
23 |
- **Intended Use Cases:** Intended for commercial and research use multiple languages. Similarly to [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B), this models is intended for assistant-like chat.
|
24 |
- **Out-of-scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws).
|
25 |
- **Release Date:** 11/27/2024
|
|
|
32 |
|
33 |
### Model Optimizations
|
34 |
|
35 |
+
This model was obtained by quantizing the weights of [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B) to FP8 data type.
|
36 |
This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
|
37 |
Weight quantization also reduces disk size requirements by approximately 50%.
|
38 |
|
39 |
Only weights and activations of the linear operators within transformers blocks are quantized.
|
40 |
+
Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and floating point representations for each output channel dimension.
|
41 |
+
Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and floating point representations.
|
42 |
|
43 |
## Deployment
|
44 |
|