schmidt-sebastian commited on
Commit
c542ba2
·
verified ·
1 Parent(s): 0f175a2

Update DeepSeek model

Browse files
.gitattributes CHANGED
@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  deepseek_q8_ekv1280.task filter=lfs diff=lfs merge=lfs -text
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  deepseek_q8_ekv1280.task filter=lfs diff=lfs merge=lfs -text
37
+ DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_f32_ekv1280.task filter=lfs diff=lfs merge=lfs -text
38
+ DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_q8_ekv1280.task filter=lfs diff=lfs merge=lfs -text
DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_f32_ekv1280.task ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10dd74836637508364b953448f1222981deaafa8ad0af2f7dbc683794b4e84cf
3
+ size 7124558842
DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_f32_ekv1280.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4eb5bae61d9717fd9d0ff9c1f00266e27ab049ec807b1e9dd1440041600bcdfc
3
+ size 7121909824
DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_q8_ekv1280.task ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1311ae8c3ad18d6c71abdf6859f63908aebe20bf42b18ba511c5fa033c179ee
3
+ size 1861094766
DeepSeek-R1-Distill-Qwen-1.5B_multi-prefill-seq_q8_ekv1280.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ad356efd69259876152347e51cf776443a4fce3abe61d7212bb685c702d70560
3
+ size 1858445888
DeepSeek-R1-Distill-Qwen-1.5B_seq128_f32_ekv1280.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ba4dd48eec47d612b9501f8365671e4a9714487a67c4276a00309637e99f7a02
3
+ size 7116878944
DeepSeek-R1-Distill-Qwen-1.5B_seq128_q8_ekv1280.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b06368fa46eb6934daf035e5bcc7bdd37086f0fd92fbbe9ce744613a379e209
3
+ size 1806773448
README.md CHANGED
@@ -1,34 +1,45 @@
1
- ---
 
2
  license: mit
3
  base_model:
4
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
5
- ---
6
 
7
  # litert-community/DeepSeek-R1-Distill-Qwen-1.5B
8
 
9
- This model provides a few variants of [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for deployment on Android using the [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
 
 
 
 
10
 
11
  ## Use the models
12
 
13
  ### Colab
14
 
15
- *Disclaimer: The target deployment surface for the LiteRT models is Android/iOS/Web and the stack has been optimized for performance on these targets. Trying out the system in Colab is an easier way to familiarize yourself with the LiteRT stack, with the caveat that the performance (memory and latency) on Colab could be much worse than on a local device.*
 
 
 
 
16
 
17
- [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/deepseek_tflite.ipynb)
18
 
19
  ### Android
20
 
21
- * Download and install [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/download/v0.1.0/llm_inference_v0.1.0-debug.apk).
22
- * Follow the instructions in the app.
23
-
24
 
25
- To build the demo app from source, please follow the [instructions](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/android/README.md) from the GitHub repository.
 
 
26
 
27
  ## Performance
28
 
29
  ### Android
30
 
31
- Note that all benchmark stats are from a Samsung S24 Ultra with 1280 KV cache size, 512 tokens prefill, 128 tokens decode.
 
32
 
33
  <table border="1">
34
  <tr>
@@ -41,26 +52,30 @@ Note that all benchmark stats are from a Samsung S24 Ultra with 1280 KV cache si
41
  <th>Model size (MB)</th>
42
  </tr>
43
  <tr>
44
- <td>fp32 (baseline)</td>
45
- <td rowspan="2">CPU</td>
46
- <td><p style="text-align: right">45</p></td>
47
- <td><p style="text-align: right">6</p></td>
48
- <td><p style="text-align: right">8</p></td>
49
- <td><p style="text-align: right">6,213</p></td>
50
- <td><p style="text-align: right">7,124</p></td>
51
- </tr>
52
- <tr>
53
- <td>dynamic_int8</td>
54
- <td><p style="text-align: right">261</p></td>
55
- <td><p style="text-align: right">23</p></td>
56
- <td><p style="text-align: right">2 </p></td>
57
- <td><p style="text-align: right">1,936 </p></td>
58
- <td><p style="text-align: right">1,861</p></td>
59
- </tr>
 
 
60
  </table>
61
 
62
- * Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models)
 
63
  * Memory: indicator of peak RAM usage
64
- * The inference on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
 
65
  * Benchmark is done assuming XNNPACK cache is enabled
66
- * dynamic_int8: quantized model with int8 weights and float activations.
 
1
+ --------------------------------------------------------------------------------
2
+
3
  license: mit
4
  base_model:
5
  - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
 
6
 
7
  # litert-community/DeepSeek-R1-Distill-Qwen-1.5B
8
 
9
+ This model provides a few variants of
10
+ [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) that are ready for
11
+ deployment on Android using the
12
+ [LiteRT (fka TFLite) stack](https://ai.google.dev/edge/litert) and
13
+ [MediaPipe LLM Inference API](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference).
14
 
15
  ## Use the models
16
 
17
  ### Colab
18
 
19
+ *Disclaimer: The target deployment surface for the LiteRT models is
20
+ Android/iOS/Web and the stack has been optimized for performance on these
21
+ targets. Trying out the system in Colab is an easier way to familiarize yourself
22
+ with the LiteRT stack, with the caveat that the performance (memory and latency)
23
+ on Colab could be much worse than on a local device.*
24
 
25
+ [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/litert-community/DeepSeek-R1-Distill-Qwen-1.5B/blob/main/DeepSeek-R1-Distill-Qwen-1.5B_tflite.ipynb)
26
 
27
  ### Android
28
 
29
+ * Download and install
30
+ [the apk](https://github.com/google-ai-edge/mediapipe-samples/releases/latest/download/llm_inference-debug.apk).
31
+ * Follow the instructions in the app.
32
 
33
+ To build the demo app from source, please follow the
34
+ [instructions](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/android/README.md)
35
+ from the GitHub repository.
36
 
37
  ## Performance
38
 
39
  ### Android
40
 
41
+ Note that all benchmark stats are from a Samsung S24 Ultra with
42
+ 1280 KV cache size with multiple prefill signatures enabled.
43
 
44
  <table border="1">
45
  <tr>
 
52
  <th>Model size (MB)</th>
53
  </tr>
54
  <tr>
55
+ <td>fp32 (baseline)</td>
56
+ <td>cpu</td>
57
+ <td><p style="text-align: right">41.84 tk/s</p></td>
58
+ <td><p style="text-align: right">6.14 tk/s</p></td>
59
+ <td><p style="text-align: right">14.30 s</p></td>
60
+ <td><p style="text-align: right">7,421 MB</p></td>
61
+ <td><p style="text-align: right">6,794 MB</p></td>
62
+ </tr>
63
+ <tr>
64
+ <td>dynamic_int8</td>
65
+ <td>cpu</td>
66
+ <td><p style="text-align: right">228.57 tk/s</p></td>
67
+ <td><p style="text-align: right">18.80 tk/s</p></td>
68
+ <td><p style="text-align: right">3.14 s</p></td>
69
+ <td><p style="text-align: right">3,600 MB</p></td>
70
+ <td><p style="text-align: right">1,774 MB</p></td>
71
+ </tr>
72
+
73
  </table>
74
 
75
+ * Model Size: measured by the size of the .tflite flatbuffer (serialization
76
+ format for LiteRT models)
77
  * Memory: indicator of peak RAM usage
78
+ * The inference on CPU is accelerated via the LiteRT
79
+ [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads
80
  * Benchmark is done assuming XNNPACK cache is enabled
81
+ * dynamic_int8: quantized model with int8 weights and float activations.
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5baf9d2f6c82b4e4eeeb7147babd3d82682381f5ee83a78faee472294dce457b
3
+ size 2648396