nm-research commited on
Commit
c54de38
·
verified ·
1 Parent(s): 2bbceff

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +278 -0
README.md ADDED
@@ -0,0 +1,278 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - w8a8
4
+ - INT8
5
+ - vllm
6
+ - audio
7
+ license: apache-2.0
8
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
9
+ language:
10
+ - en
11
+ base_model: openai/whisper-tiny
12
+ library_name: transformers
13
+ ---
14
+
15
+ # whisper-tiny-quantized.w8a8
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** whisper-tiny
19
+ - **Input:** Audio-Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT8
23
+ - **Activation quantization:** INT8
24
+ - **Release Date:** 04/16/2025
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) to INT8 data type, ready for inference with vLLM >= 0.5.2.
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm.assets.audio import AudioAsset
42
+ from vllm import LLM, SamplingParams
43
+
44
+ # prepare model
45
+ llm = LLM(
46
+ model="neuralmagic/whisper-tiny-quantized.w8a8",
47
+ max_model_len=448,
48
+ max_num_seqs=400,
49
+ limit_mm_per_prompt={"audio": 1},
50
+ )
51
+
52
+ # prepare inputs
53
+ inputs = { # Test explicit encoder/decoder prompt
54
+ "encoder_prompt": {
55
+ "prompt": "",
56
+ "multi_modal_data": {
57
+ "audio": AudioAsset("winning_call").audio_and_sample_rate,
58
+ },
59
+ },
60
+ "decoder_prompt": "<|startoftranscript|>",
61
+ }
62
+
63
+ # generate response
64
+ print("========== SAMPLE GENERATION ==============")
65
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
66
+ print(f"PROMPT : {outputs[0].prompt}")
67
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
68
+ print("==========================================")
69
+ ```
70
+
71
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
+
73
+ ## Creation
74
+
75
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
76
+
77
+ <details>
78
+ <summary>Model Creation Code</summary>
79
+
80
+ ```bash
81
+ python quantize.py --model_path openai/whisper-tiny --quant_path "output_dir/whisper-tiny-quantized.w8a8" --calib_size 1024 --dampening_frac 0.01
82
+ ```
83
+
84
+
85
+ ```python
86
+ import torch
87
+ import argparse
88
+ from datasets import load_dataset
89
+ from transformers import WhisperProcessor
90
+ from llmcompressor import oneshot
91
+ from llmcompressor.modifiers.quantization import GPTQModifier
92
+ from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
93
+ import os
94
+ from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationStrategy, ActivationOrdering, QuantizationScheme
95
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
96
+
97
+ parser = argparse.ArgumentParser()
98
+ parser.add_argument('--model_path', type=str)
99
+ parser.add_argument('--quant_path', type=str)
100
+ parser.add_argument('--calib_size', type=int, default=256)
101
+ parser.add_argument('--dampening_frac', type=float, default=0.1)
102
+ parser.add_argument('--observer', type=str, default="minmax")
103
+ parser.add_argument('--save_dir', type=str, required=True)
104
+
105
+
106
+ args = parser.parse_args()
107
+ model_id = args.model_path
108
+
109
+ model = TraceableWhisperForConditionalGeneration.from_pretrained(
110
+ model_id,
111
+ device_map="auto",
112
+ torch_dtype="auto",
113
+ )
114
+ model.config.forced_decoder_ids = None
115
+ processor = WhisperProcessor.from_pretrained(model_id)
116
+
117
+ # Configure processor the dataset task.
118
+ processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")
119
+
120
+ # Select calibration dataset.
121
+ DATASET_ID = "MLCommons/peoples_speech"
122
+ DATASET_SUBSET = "test"
123
+ DATASET_SPLIT = "test"
124
+
125
+ # Select number of samples for calibration. 512 samples is a good place to start.
126
+ # Increasing the number of samples can improve accuracy.
127
+
128
+ NUM_CALIBRATION_SAMPLES = args.calib_size
129
+ MAX_SEQUENCE_LENGTH = 2048
130
+ dampening_frac=args.dampening_frac
131
+ actorder_arg=args.actorder
132
+ group_size=args.group_size
133
+
134
+ # Load dataset and preprocess.
135
+ ds = load_dataset(
136
+ DATASET_ID,
137
+ DATASET_SUBSET,
138
+ split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
139
+ trust_remote_code=True,
140
+ )
141
+
142
+ def preprocess(example):
143
+ return {
144
+ "array": example["audio"]["array"],
145
+ "sampling_rate": example["audio"]["sampling_rate"],
146
+ "text": " " + example["text"].capitalize(),
147
+ }
148
+
149
+ ds = ds.map(preprocess, remove_columns=ds.column_names)
150
+
151
+ # Process inputs.
152
+ def process(sample):
153
+ inputs = processor(
154
+ audio=sample["array"],
155
+ sampling_rate=sample["sampling_rate"],
156
+ text=sample["text"],
157
+ add_special_tokens=True,
158
+ return_tensors="pt",
159
+ )
160
+
161
+ inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
162
+ inputs["decoder_input_ids"] = inputs["labels"]
163
+ del inputs["labels"]
164
+
165
+ return inputs
166
+
167
+ ds = ds.map(process, remove_columns=ds.column_names)
168
+
169
+ # Define a oneshot data collator for multimodal inputs.
170
+ def data_collator(batch):
171
+ assert len(batch) == 1
172
+ return {key: torch.tensor(value) for key, value in batch[0].items()}
173
+
174
+ ignore=["lm_head"]
175
+
176
+ #Recipe
177
+ recipe = [
178
+ GPTQModifier(
179
+ targets="Linear",
180
+ scheme="W8A8",
181
+ sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"],
182
+ ignore=ignore,
183
+ )
184
+ ]
185
+
186
+ # Apply algorithms.
187
+ oneshot(
188
+ model=model,
189
+ dataset=ds,
190
+ recipe=recipe,
191
+ max_seq_length=MAX_SEQUENCE_LENGTH,
192
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
193
+ data_collator=data_collator,
194
+ )
195
+
196
+
197
+ # Save to disk compressed.
198
+ save_name = f"{model_id.split('/')[-1]}-quantized.w8a8"
199
+ save_path = os.path.join(args.save_dir, save_name)
200
+ print("Saving model:", save_path)
201
+ model.save_pretrained(save_path, save_compressed=True)
202
+ processor.save_pretrained(save_path)
203
+ ```
204
+ </details>
205
+
206
+ ## Evaluation
207
+
208
+ The model was evaluated on [LibriSpeech](https://huggingface.co/datasets/lmms-lab/librispeech) and [Fleurs](https://huggingface.co/datasets/lmms-lab/fleurs) datasets using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), via the following commands:
209
+
210
+ <details>
211
+ <summary>Evaluation Commands</summary>
212
+
213
+ Librispeech:
214
+ ```
215
+ lmms-eval \
216
+ --model=whisper_vllm \
217
+ --model_args="pretrained=neuralmagic-ent/whisper-tiny-quantized.w8a8" \
218
+ --batch_size 64 \
219
+ --output_path <output_file_path> \
220
+ --tasks librispeech
221
+ ```
222
+
223
+ Fleurs:
224
+ ```
225
+ lmms-eval \
226
+ --model=whisper_vllm \
227
+ --model_args="pretrained=neuralmagic-ent/whisper-tiny-quantized.w8a8" \
228
+ --batch_size 64 \
229
+ --output_path <output_file_path> \
230
+ --tasks fleurs
231
+ ```
232
+ </details>
233
+
234
+ <table>
235
+ <thead>
236
+ <tr>
237
+ <th>Benchmark</th>
238
+ <th>Split</th>
239
+ <th>BF16</th>
240
+ <th>w8a8</th>
241
+ <th>Recovery (%)</th>
242
+ </tr>
243
+ </thead>
244
+ <tbody>
245
+ <tr>
246
+ <td rowspan="2"><b>LibriSpeech (WER)</b></td>
247
+ <td>test-clean</td>
248
+ <td></td>
249
+ <td></td>
250
+ <td></td>
251
+ </tr>
252
+ <tr>
253
+ <td>test-other</td>
254
+ <td></td>
255
+ <td></td>
256
+ <td></td>
257
+ </tr>
258
+ <tr>
259
+ <td rowspan="3"><b>Fleurs (X→en, BLEU)</b></td>
260
+ <td>cmn_hans_cn</td>
261
+ <td></td>
262
+ <td></td>
263
+ <td></td>
264
+ </tr>
265
+ <tr>
266
+ <td>en</td>
267
+ <td></td>
268
+ <td></td>
269
+ <td></td>
270
+ </tr>
271
+ <tr>
272
+ <td>yue_hant_hk</td>
273
+ <td></td>
274
+ <td></td>
275
+ <td></td>
276
+ </tr>
277
+ </tbody>
278
+ </table>