qaihm-bot commited on
Commit
11603c2
·
verified ·
1 Parent(s): 44681b8

See https://github.com/quic/ai-hub-models/releases/v0.30.2 for changelog.

Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -15,7 +15,7 @@ pipeline_tag: text-generation
15
  ## State-of-the-art large language model useful on a variety of language understanding and generation tasks
16
 
17
 
18
- Llama 3 is a family of LLMs. The "Chat" at the end indicates that the model is optimized for chatbot-like dialogue. The model is quantized to w4a16 (4-bit weights and 16-bit activations) and part of the model is quantized to w8a16 (8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-Quantized's latency.
19
 
20
  This model is an implementation of Llama-v3.2-3B-Instruct found [here](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/).
21
 
@@ -50,7 +50,7 @@ This model is an implementation of Llama-v3.2-3B-Instruct found [here](https://h
50
  | Llama-v3.2-3B-Chat | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | QNN | 18.4176 | 0.12593600000000002 - 4.029952000000001 | -- | -- |
51
  | Llama-v3.2-3B-Chat | w4a16 | SA8255P ADP | Qualcomm® SA8255P | QNN | 14.02377 | 0.187414 - 5.997256999999999 | -- | -- |
52
 
53
- ## Deploying Llama 3.2 on-device
54
 
55
  Please follow the [LLM on-device deployment](https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie) tutorial.
56
 
 
15
  ## State-of-the-art large language model useful on a variety of language understanding and generation tasks
16
 
17
 
18
+ Llama 3 is a family of LLMs. The model is quantized to w4a16 (4-bit weights and 16-bit activations) and part of the model is quantized to w8a16 (8-bit weights and 16-bit activations) making it suitable for on-device deployment. For Prompt and output length specified below, the time to first token is Llama-PromptProcessor-Quantized's latency and average time per addition token is Llama-TokenGenerator-Quantized's latency.
19
 
20
  This model is an implementation of Llama-v3.2-3B-Instruct found [here](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct/).
21
 
 
50
  | Llama-v3.2-3B-Chat | w4a16 | Snapdragon X Elite CRD | Snapdragon® X Elite | QNN | 18.4176 | 0.12593600000000002 - 4.029952000000001 | -- | -- |
51
  | Llama-v3.2-3B-Chat | w4a16 | SA8255P ADP | Qualcomm® SA8255P | QNN | 14.02377 | 0.187414 - 5.997256999999999 | -- | -- |
52
 
53
+ ## Deploying Llama 3.2 3B on-device
54
 
55
  Please follow the [LLM on-device deployment](https://github.com/quic/ai-hub-apps/tree/main/tutorials/llm_on_genie) tutorial.
56