matejhornik commited on
Commit
7daa40c
·
verified ·
1 Parent(s): c85954a

file upload

Browse files
Files changed (39) hide show
  1. .gitattributes +1 -0
  2. README.md +296 -3
  3. all_results.json +20 -0
  4. checkpoint-29000/config.json +303 -0
  5. checkpoint-29000/generation_config.json +13 -0
  6. checkpoint-29000/model.safetensors +3 -0
  7. checkpoint-29000/optimizer.pt +3 -0
  8. checkpoint-29000/preprocessor_config.json +9 -0
  9. checkpoint-29000/rng_state.pth +3 -0
  10. checkpoint-29000/scheduler.pt +3 -0
  11. checkpoint-29000/trainer_state.json +0 -0
  12. checkpoint-29000/training_args.bin +3 -0
  13. config.json +303 -0
  14. create_model.py +49 -0
  15. eval_dev_results.json +9 -0
  16. eval_test_results.json +9 -0
  17. generation_config.json +13 -0
  18. merges.txt +0 -0
  19. model.safetensors +3 -0
  20. preprocessor_config.json +9 -0
  21. special_tokens_map.json +51 -0
  22. tokenizer.json +0 -0
  23. tokenizer_config.json +57 -0
  24. train_results.json +9 -0
  25. trainer_state.json +0 -0
  26. training_args.bin +3 -0
  27. vocab.json +0 -0
  28. wandb/.DS_Store +0 -0
  29. wandb/run-20250515_192303-7xkscxrj/files/config.yaml +1039 -0
  30. wandb/run-20250515_192303-7xkscxrj/files/media/table/model_speed2size1_table_3555_34483c9cf24b143db620.table.json +1 -0
  31. wandb/run-20250515_192303-7xkscxrj/files/media/table/model_speed2size2_table_3556_ffc3f22eaf8a279337f3.table.json +1 -0
  32. wandb/run-20250515_192303-7xkscxrj/files/output.log +0 -0
  33. wandb/run-20250515_192303-7xkscxrj/files/requirements.txt +184 -0
  34. wandb/run-20250515_192303-7xkscxrj/files/wandb-metadata.json +96 -0
  35. wandb/run-20250515_192303-7xkscxrj/files/wandb-summary.json +1 -0
  36. wandb/run-20250515_192303-7xkscxrj/logs/debug-core.log +15 -0
  37. wandb/run-20250515_192303-7xkscxrj/logs/debug-internal.log +17 -0
  38. wandb/run-20250515_192303-7xkscxrj/logs/debug.log +35 -0
  39. wandb/run-20250515_192303-7xkscxrj/run-7xkscxrj.wandb +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ wandb/run-20250515_192303-7xkscxrj/run-7xkscxrj.wandb filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,296 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ model_name: Wav2Vec2-BART (Base) English ASR - VoxPopuli Best WER
4
+ license: mit
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - speech-encoder-decoder
8
+ - wav2vec2
9
+ - bart
10
+ - english
11
+ - voxpopuli
12
+ - generated_from_trainer
13
+ - audio
14
+ - master-thesis
15
+ datasets:
16
+ - facebook/voxpopuli
17
+ base_model:
18
+ - facebook/wav2vec2-base-en-voxpopuli-v2
19
+ - facebook/bart-base
20
+ model-index:
21
+ - name: matejhornik/wav2vec2-base_bart-base_voxpopuli-en
22
+ results:
23
+ - task:
24
+ type: automatic-speech-recognition
25
+ name: Automatic Speech Recognition
26
+ dataset:
27
+ name: VoxPopuli (English, Test)
28
+ type: facebook/voxpopuli
29
+ config: en
30
+ split: test
31
+ metrics:
32
+ - name: WER
33
+ type: wer
34
+ value: 0.08848048503220916 # 8.85%
35
+ - task:
36
+ type: automatic-speech-recognition
37
+ name: Automatic Speech Recognition
38
+ dataset:
39
+ name: VoxPopuli (English, Validation)
40
+ type: facebook/voxpopuli
41
+ config: en
42
+ split: validation
43
+ metrics:
44
+ - name: WER
45
+ type: wer
46
+ value: 0.08554638942253362 # 8.55%
47
+ pipeline_tag: automatic-speech-recognition
48
+ library_name: transformers
49
+ ---
50
+
51
+ # Wav2Vec2-BART (Base) for English ASR on VoxPopuli - Best WER from Master's Thesis
52
+
53
+ This repository contains the checkpoint for a `SpeechEncoderDecoderModel` fine-tuned for Automatic Speech Recognition (ASR) on the English portion of the VoxPopuli dataset. This model achieved the **best Word Error Rate (WER) of 8.85% on the VoxPopuli English test set** within the experimental framework of the Master's thesis "Effective Training of Neural Networks for Automatic Speech Recognition" by Matej Horník.
54
+
55
+ The model leverages a pre-trained **Wav2Vec2 (Base)** encoder (`facebook/wav2vec2-base-en-voxpopuli-v2`) and a pre-trained **BART (Base)** decoder (`facebook/bart-base`), connected via convolutional adapter layers.
56
+
57
+ ## Thesis Context
58
+
59
+ This model is a direct result of work conducted for the Master's thesis:
60
+
61
+ * **Title:** Effective Training of Neural Networks for Automatic Speech Recognition
62
+ * **Author:** Matej Horník
63
+ * **Supervisor:** Ing. Alexander Polok
64
+ * **Institution:** Brno University of Technology, Faculty of Information Technology
65
+ * **Year:** 2025
66
+ * **Thesis Link:** [Link to thesis PDF](https://www.vut.cz/en/students/final-thesis/detail/164401)
67
+
68
+ > [!NOTE]
69
+ > Link will be available after the thesis defense.
70
+
71
+ ### Thesis Abstract (English)
72
+ This master's thesis focuses on improving the training efficiency and performance of encoder-decoder transformer models for Automatic Speech Recognition (ASR). It investigates the impact of initialization strategies using pre-trained components (Wav2Vec2, BART), the role of convolutional adapters, and Parameter-Efficient Fine-tuning (PEFT) methods like LoRA and DoRA. Experiments on LibriSpeech and VoxPopuli datasets confirmed that full pre-trained initialization is crucial for best Word Error Rate (WER) and convergence. An optimal number of adapters improved performance, while PEFT (especially LoRA) significantly reduced trainable parameters with comparable accuracy. Domain-specific encoder pre-training proved beneficial, and the encoder-decoder model outperformed a CTC baseline in accuracy, offering practical insights for efficient ASR training.
73
+
74
+ ## Model Details
75
+
76
+ * **Encoder:** `facebook/wav2vec2-base-en-voxpopuli-v2`. This is a Wav2Vec2 (Base) model pre-trained by Facebook on 24.1k hours of unlabeled English VoxPopuli data.
77
+ * **Decoder:** `facebook/bart-base`. This is a standard BART (Base) model.
78
+ * **Architecture:** `SpeechEncoderDecoderModel` from Hugging Face Transformers.
79
+ * **Adapters:** 3 convolutional adapter layers were added to the encoder's output to better align its temporal resolution with the BART decoder's input requirements.
80
+ * **Feature Extractor:** The Wav2Vec2 feature extractor (initial CNN layers) was kept frozen during fine-tuning, as experiments showed this maintained performance while reducing trainable parameters.
81
+
82
+ ### Initial Model Construction
83
+ The base model (before fine-tuning for this specific result) was constructed by combining the pre-trained `facebook/wav2vec2-base-en-voxpopuli-v2` (encoder) and `facebook/bart-base` (decoder) using `SpeechEncoderDecoderModel.from_encoder_decoder_pretrained`. To create the model, code is provided in [create_model.py](create_model.py).
84
+
85
+ ```bash
86
+ python create_model.py
87
+ ```
88
+
89
+
90
+ ## Training Data
91
+
92
+ ### Data
93
+ The model was fine-tuned on the `train` split of the English portion of the [VoxPopuli dataset](https://huggingface.co/datasets/facebook/voxpopuli) (`facebook/voxpopuli`, config `en`).
94
+ Audio data was resampled to 16kHz. Text transcriptions were tokenized using the BART tokenizer and lowercased.
95
+
96
+ ### Procedure
97
+ The model was fine-tuned using modified [`run_speech_recognition_seq2seq.py`](https://github.com/hornikmatej/thesis_mit/blob/main/run_speech_recognition_seq2seq.py) script (provided in the thesis materials, based on Hugging Face's example scripts).
98
+
99
+ **Key Hyperparameters:**
100
+ * **Optimizer:** AdamW
101
+ * **Learning Rate:** `1e-4`
102
+ * **LR Scheduler:** `cosine_with_min_lr` (min\_lr: `5e-9`)
103
+ * **Warmup Steps:** 2000
104
+ * **Batch Size (per device):** 96
105
+ * **Gradient Accumulation Steps:** 1
106
+ * **Number of Epochs:** 20
107
+ * **Weight Decay:** 0.01
108
+ * **Label Smoothing Factor:** 0.05
109
+ * **Mixed Precision:** bf16
110
+ * **SpecAugment:** Applied during training
111
+ * `mask_time_prob`: 0.25, `mask_time_length`: 30, `mask_time_min_masks`: 2
112
+ * `mask_feature_prob`: 0.3, `mask_feature_length`: 30, `mask_feature_min_masks`: 1
113
+ * **Feature Extractor:** Frozen
114
+
115
+ The full training command can be found in the [thesis materials](https://github.com/hornikmatej/thesis_mit/blob/main/run_scripts/voxpopuli_best.sh), including the specific arguments used.
116
+
117
+
118
+ ## Evaluation
119
+
120
+ The model achieves the following Word Error Rate (WER) on the VoxPopuli English dataset:
121
+
122
+ | Dataset Split | WER (%) | Loss |
123
+ |---------------|---------|-------|
124
+ | Validation | 8.55% | 1.056 |
125
+ | Test | 8.85% | 1.076 |
126
+
127
+
128
+ For detailed training logs, metrics, and visualizations, please refer to the Weights & Biases report:
129
+
130
+ [![alt text](https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg)](https://api.wandb.ai/links/xhorni20-fitvut/2018dikj)
131
+
132
+ ## How to Use
133
+
134
+ You can use this model for inference with the Hugging Face `transformers` library. Make sure you have `torchaudio` and `librosa` (or `soundfile`) installed for audio processing.
135
+
136
+ ```python
137
+ from transformers import SpeechEncoderDecoderModel, AutoProcessor
138
+ import torch
139
+ import soundfile as sf
140
+
141
+ model_id = "matejhornik/wav2vec2-base_bart-base_voxpopuli-en"
142
+ device = "cuda" if torch.cuda.is_available() else "cpu"
143
+
144
+ # Load the processor (feature extractor and tokenizer)
145
+ processor = AutoProcessor.from_pretrained(model_id)
146
+
147
+ # Load the model
148
+ model = SpeechEncoderDecoderModel.from_pretrained(model_id).to(device)
149
+
150
+ def transcribe_audio(audio_path):
151
+ """Loads audio, processes it, and transcribes it."""
152
+ speech_array, sampling_rate = sf.read(audio_path)
153
+
154
+ # Ensure audio is 16kHz as expected by the model
155
+ if sampling_rate != processor.feature_extractor.sampling_rate:
156
+ raise ValueError(f"Audio sampling rate {sampling_rate} does not match model's required {processor.feature_extractor.sampling_rate}Hz. Please resample.")
157
+
158
+ # Preprocess the audio
159
+ inputs = processor(speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True)
160
+ input_features = inputs.input_features.to(device)
161
+ attention_mask = inputs.attention_mask.to(device)
162
+
163
+ # Generate transcription
164
+ with torch.no_grad():
165
+ predicted_ids = model.generate(input_features, attention_mask=attention_mask, max_length=128)
166
+
167
+ # Decode the transcription
168
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
169
+ return transcription[0]
170
+
171
+ # Example usage:
172
+ audio_file_path = "path/to/your/audio.wav"
173
+ try:
174
+ transcription = transcribe_audio(audio_file_path)
175
+ print(f"Transcription: {transcription}")
176
+ except ValueError as e:
177
+ print(e)
178
+ except FileNotFoundError:
179
+ print(f"Audio file not found at: {audio_file_path}. Please provide a valid path.")
180
+ ```
181
+
182
+ ## Reproducing Evaluation on VoxPopuli
183
+ To reproduce the evaluation results on the VoxPopuli test set:
184
+
185
+ ```python
186
+ from datasets import load_dataset
187
+ from transformers import SpeechEncoderDecoderModel, AutoProcessor
188
+ import torch
189
+ from jiwer import wer
190
+ from tqdm import tqdm
191
+
192
+ model_id = "matejhornik/wav2vec2-base_bart-base_voxpopuli-en"
193
+ dataset_name = "facebook/voxpopuli"
194
+ dataset_config = "en"
195
+ split = "test" # or "validation"
196
+
197
+ device = "cuda" if torch.cuda.is_available() else "cpu"
198
+
199
+ # Load processor and model
200
+ processor = AutoProcessor.from_pretrained(model_id)
201
+ model = SpeechEncoderDecoderModel.from_pretrained(model_id).to(device)
202
+ model.eval() # Set model to evaluation mode
203
+
204
+ # Load dataset
205
+ # Note: You might need to authenticate with Hugging Face if the dataset requires it
206
+ # from huggingface_hub import login
207
+ voxpopuli_test = load_dataset(dataset_name, dataset_config, split=split, streaming=False) # Set streaming=True for large datasets if memory is an issue
208
+
209
+ # Preprocessing function
210
+ def map_to_pred(batch):
211
+ # Ensure audio is in the correct format (array, 16kHz)
212
+ audio_data = batch["audio"]["array"]
213
+ sampling_rate = batch["audio"]["sampling_rate"]
214
+
215
+ if sampling_rate != processor.feature_extractor.sampling_rate:
216
+ print(f"Warning: Resampling needed or sample skipped for audio with rate {sampling_rate}")
217
+ # Dummy processing for now if rate mismatch
218
+ input_features = torch.zeros((1,1000)) # Placeholder
219
+ else:
220
+ inputs = processor(audio_data, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
221
+ input_features = inputs.input_features.to(device)
222
+
223
+ with torch.no_grad():
224
+ predicted_ids = model.generate(input_features, max_length=128)
225
+
226
+ transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
227
+ batch["prediction"] = transcription[0]
228
+ batch["reference"] = batch["normalized_text"]
229
+ return batch
230
+
231
+
232
+ predictions = []
233
+ references = []
234
+
235
+ for sample in tqdm(voxpopuli_test):
236
+ try:
237
+ processed_sample = map_to_pred(sample)
238
+ predictions.append(processed_sample["prediction"])
239
+ references.append(processed_sample["reference"])
240
+ except Exception as e:
241
+ print(f"Error processing sample: {e}")
242
+
243
+
244
+ # Calculate WER
245
+ if predictions and references:
246
+ current_wer = wer(references, predictions)
247
+ print(f"WER on {split} set: {current_wer:.4f}")
248
+ else:
249
+ print("No samples processed or an error occurred.")
250
+
251
+ # Expected WER on test set: 0.0885
252
+ # Expected WER on validation set: 0.0855
253
+ ```
254
+
255
+ ### Framework Versions
256
+
257
+ This model was trained using:
258
+ - Python: `^3.10`
259
+ - Transformers: `~4.46.3`
260
+ - PyTorch: `~2.5.1`
261
+ - Datasets: `^3.2.0`
262
+ - PEFT: `^0.14.0`
263
+ - Accelerate: `^1.4.0`
264
+ - Evaluate: `^0.4.3`
265
+ - WandB: `^0.19.7`
266
+
267
+ ## Citation
268
+ Citation
269
+ If you use this model or findings from the thesis, please cite:
270
+
271
+ [![CITE](https://excel.fit.vutbr.cz/wp-content/images/2023/FIT_color_CMYK_EN.svg)](https://www.vut.cz/en/students/final-thesis/detail/164401)
272
+
273
+ ```bibtex
274
+ @mastersthesis{Hornik2025EffectiveTraining,
275
+ author = {Horník, Matej},
276
+ title = {Effective Training of Neural Networks for Automatic Speech Recognition},
277
+ school = {Brno University of Technology, Faculty of Information Technology},
278
+ year = {2025},
279
+ supervisor = {Polok, Alexander},
280
+ type = {Master's Thesis},
281
+ note = {Online. Available at: \url{https://www.vut.cz/en/students/final-thesis/detail/164401}}
282
+ }
283
+ ```
284
+
285
+ ## Acknowledgements
286
+ - My supervisor, Ing. Alexander Polok, for his valuable guidance and support.
287
+ - The Hugging Face team for their comprehensive transformers, datasets, and evaluate libraries.
288
+ - The creators of Wav2Vec2, BART, and the VoxPopuli dataset.
289
+
290
+ ## Contact
291
+ For questions, feedback, or collaboration opportunities related to this thesis or any other stuff, feel free to reach out:
292
+
293
294
+ - **GitHub:** [hornikmatej](https://github.com/hornikmatej)
295
+
296
+
all_results.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "eval_dev_loss": 1.0564184188842773,
4
+ "eval_dev_runtime": 121.5437,
5
+ "eval_dev_samples_per_second": 13.09,
6
+ "eval_dev_steps_per_second": 0.14,
7
+ "eval_dev_wer": 0.08554638942253362,
8
+ "eval_samples": 1705,
9
+ "eval_test_loss": 1.0758554935455322,
10
+ "eval_test_runtime": 132.2526,
11
+ "eval_test_samples_per_second": 12.892,
12
+ "eval_test_steps_per_second": 0.136,
13
+ "eval_test_wer": 0.08848048503220916,
14
+ "total_flos": 0.0,
15
+ "train_loss": 1.6298207611684346,
16
+ "train_runtime": 35628.5116,
17
+ "train_samples": 167046,
18
+ "train_samples_per_second": 93.771,
19
+ "train_steps_per_second": 0.977
20
+ }
checkpoint-29000/config.json ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli",
3
+ "architectures": [
4
+ "SpeechEncoderDecoderModel"
5
+ ],
6
+ "decoder": {
7
+ "_attn_implementation_autoset": true,
8
+ "_name_or_path": "facebook/bart-base",
9
+ "activation_dropout": 0.1,
10
+ "activation_function": "gelu",
11
+ "add_bias_logits": false,
12
+ "add_cross_attention": true,
13
+ "add_final_layer_norm": false,
14
+ "architectures": [
15
+ "BartModel"
16
+ ],
17
+ "attention_dropout": 0.1,
18
+ "bad_words_ids": null,
19
+ "begin_suppress_tokens": null,
20
+ "bos_token_id": 0,
21
+ "chunk_size_feed_forward": 0,
22
+ "classif_dropout": 0.1,
23
+ "classifier_dropout": 0.0,
24
+ "cross_attention_hidden_size": null,
25
+ "d_model": 768,
26
+ "decoder_attention_heads": 12,
27
+ "decoder_ffn_dim": 3072,
28
+ "decoder_layerdrop": 0.0,
29
+ "decoder_layers": 6,
30
+ "decoder_start_token_id": 2,
31
+ "diversity_penalty": 0.0,
32
+ "do_sample": false,
33
+ "dropout": 0.1,
34
+ "early_stopping": true,
35
+ "encoder_attention_heads": 12,
36
+ "encoder_ffn_dim": 3072,
37
+ "encoder_layerdrop": 0.0,
38
+ "encoder_layers": 6,
39
+ "encoder_no_repeat_ngram_size": 0,
40
+ "eos_token_id": 2,
41
+ "exponential_decay_length_penalty": null,
42
+ "finetuning_task": null,
43
+ "forced_bos_token_id": 0,
44
+ "forced_eos_token_id": 2,
45
+ "gradient_checkpointing": false,
46
+ "id2label": {
47
+ "0": "LABEL_0",
48
+ "1": "LABEL_1",
49
+ "2": "LABEL_2"
50
+ },
51
+ "init_std": 0.02,
52
+ "is_decoder": true,
53
+ "is_encoder_decoder": false,
54
+ "label2id": {
55
+ "LABEL_0": 0,
56
+ "LABEL_1": 1,
57
+ "LABEL_2": 2
58
+ },
59
+ "length_penalty": 1.0,
60
+ "max_length": 20,
61
+ "max_position_embeddings": 1024,
62
+ "min_length": 0,
63
+ "model_type": "bart",
64
+ "no_repeat_ngram_size": 3,
65
+ "normalize_before": false,
66
+ "normalize_embedding": true,
67
+ "num_beam_groups": 1,
68
+ "num_beams": 4,
69
+ "num_hidden_layers": 6,
70
+ "num_return_sequences": 1,
71
+ "output_attentions": false,
72
+ "output_hidden_states": false,
73
+ "output_scores": false,
74
+ "pad_token_id": 1,
75
+ "prefix": null,
76
+ "problem_type": null,
77
+ "pruned_heads": {},
78
+ "remove_invalid_values": false,
79
+ "repetition_penalty": 1.0,
80
+ "return_dict": true,
81
+ "return_dict_in_generate": false,
82
+ "scale_embedding": false,
83
+ "sep_token_id": null,
84
+ "suppress_tokens": null,
85
+ "task_specific_params": {
86
+ "summarization": {
87
+ "length_penalty": 1.0,
88
+ "max_length": 128,
89
+ "min_length": 12,
90
+ "num_beams": 4
91
+ },
92
+ "summarization_cnn": {
93
+ "length_penalty": 2.0,
94
+ "max_length": 142,
95
+ "min_length": 56,
96
+ "num_beams": 4
97
+ },
98
+ "summarization_xsum": {
99
+ "length_penalty": 1.0,
100
+ "max_length": 62,
101
+ "min_length": 11,
102
+ "num_beams": 6
103
+ }
104
+ },
105
+ "temperature": 1.0,
106
+ "tf_legacy_loss": false,
107
+ "tie_encoder_decoder": false,
108
+ "tie_word_embeddings": true,
109
+ "tokenizer_class": null,
110
+ "top_k": 50,
111
+ "top_p": 1.0,
112
+ "torch_dtype": "float32",
113
+ "torchscript": false,
114
+ "typical_p": 1.0,
115
+ "use_bfloat16": false,
116
+ "use_cache": true,
117
+ "vocab_size": 50265
118
+ },
119
+ "decoder_start_token_id": 0,
120
+ "encoder": {
121
+ "_attn_implementation_autoset": true,
122
+ "_name_or_path": "facebook/wav2vec2-base-en-voxpopuli-v2",
123
+ "activation_dropout": 0.0,
124
+ "adapter_attn_dim": null,
125
+ "adapter_kernel_size": 3,
126
+ "adapter_stride": 2,
127
+ "add_adapter": true,
128
+ "add_cross_attention": false,
129
+ "apply_spec_augment": true,
130
+ "architectures": [
131
+ "Wav2Vec2ForPreTraining"
132
+ ],
133
+ "attention_dropout": 0.1,
134
+ "bad_words_ids": null,
135
+ "begin_suppress_tokens": null,
136
+ "bos_token_id": 1,
137
+ "chunk_size_feed_forward": 0,
138
+ "classifier_proj_size": 256,
139
+ "codevector_dim": 256,
140
+ "contrastive_logits_temperature": 0.1,
141
+ "conv_bias": false,
142
+ "conv_dim": [
143
+ 512,
144
+ 512,
145
+ 512,
146
+ 512,
147
+ 512,
148
+ 512,
149
+ 512
150
+ ],
151
+ "conv_kernel": [
152
+ 10,
153
+ 3,
154
+ 3,
155
+ 3,
156
+ 3,
157
+ 2,
158
+ 2
159
+ ],
160
+ "conv_stride": [
161
+ 5,
162
+ 2,
163
+ 2,
164
+ 2,
165
+ 2,
166
+ 2,
167
+ 2
168
+ ],
169
+ "cross_attention_hidden_size": null,
170
+ "ctc_loss_reduction": "sum",
171
+ "ctc_zero_infinity": false,
172
+ "decoder_start_token_id": null,
173
+ "diversity_loss_weight": 0.1,
174
+ "diversity_penalty": 0.0,
175
+ "do_sample": false,
176
+ "do_stable_layer_norm": false,
177
+ "early_stopping": false,
178
+ "encoder_no_repeat_ngram_size": 0,
179
+ "eos_token_id": 2,
180
+ "exponential_decay_length_penalty": null,
181
+ "feat_extract_activation": "gelu",
182
+ "feat_extract_norm": "group",
183
+ "feat_proj_dropout": 0.0,
184
+ "feat_quantizer_dropout": 0.0,
185
+ "final_dropout": 0.0,
186
+ "finetuning_task": null,
187
+ "forced_bos_token_id": null,
188
+ "forced_eos_token_id": null,
189
+ "freeze_feat_extract_train": true,
190
+ "hidden_act": "gelu",
191
+ "hidden_dropout": 0.1,
192
+ "hidden_size": 768,
193
+ "id2label": {
194
+ "0": "LABEL_0",
195
+ "1": "LABEL_1"
196
+ },
197
+ "initializer_range": 0.02,
198
+ "intermediate_size": 3072,
199
+ "is_decoder": false,
200
+ "is_encoder_decoder": false,
201
+ "label2id": {
202
+ "LABEL_0": 0,
203
+ "LABEL_1": 1
204
+ },
205
+ "layer_norm_eps": 1e-05,
206
+ "layerdrop": 0.0,
207
+ "length_penalty": 1.0,
208
+ "mask_channel_length": 10,
209
+ "mask_channel_min_space": 1,
210
+ "mask_channel_other": 0.0,
211
+ "mask_channel_prob": 0.0,
212
+ "mask_channel_selection": "static",
213
+ "mask_feature_length": 30,
214
+ "mask_feature_min_masks": 1,
215
+ "mask_feature_prob": 0.3,
216
+ "mask_time_length": 30,
217
+ "mask_time_min_masks": 2,
218
+ "mask_time_min_space": 1,
219
+ "mask_time_other": 0.0,
220
+ "mask_time_prob": 0.25,
221
+ "mask_time_selection": "static",
222
+ "max_length": 20,
223
+ "min_length": 0,
224
+ "model_type": "wav2vec2",
225
+ "no_mask_channel_overlap": false,
226
+ "no_mask_time_overlap": false,
227
+ "no_repeat_ngram_size": 0,
228
+ "num_adapter_layers": 3,
229
+ "num_attention_heads": 12,
230
+ "num_beam_groups": 1,
231
+ "num_beams": 1,
232
+ "num_codevector_groups": 2,
233
+ "num_codevectors_per_group": 320,
234
+ "num_conv_pos_embedding_groups": 16,
235
+ "num_conv_pos_embeddings": 128,
236
+ "num_feat_extract_layers": 7,
237
+ "num_hidden_layers": 12,
238
+ "num_negatives": 100,
239
+ "num_return_sequences": 1,
240
+ "output_attentions": false,
241
+ "output_hidden_size": 768,
242
+ "output_hidden_states": false,
243
+ "output_scores": false,
244
+ "pad_token_id": 0,
245
+ "prefix": null,
246
+ "problem_type": null,
247
+ "proj_codevector_dim": 256,
248
+ "pruned_heads": {},
249
+ "remove_invalid_values": false,
250
+ "repetition_penalty": 1.0,
251
+ "return_dict": true,
252
+ "return_dict_in_generate": false,
253
+ "sep_token_id": null,
254
+ "suppress_tokens": null,
255
+ "task_specific_params": null,
256
+ "tdnn_dilation": [
257
+ 1,
258
+ 2,
259
+ 3,
260
+ 1,
261
+ 1
262
+ ],
263
+ "tdnn_dim": [
264
+ 512,
265
+ 512,
266
+ 512,
267
+ 512,
268
+ 1500
269
+ ],
270
+ "tdnn_kernel": [
271
+ 5,
272
+ 3,
273
+ 3,
274
+ 1,
275
+ 1
276
+ ],
277
+ "temperature": 1.0,
278
+ "tf_legacy_loss": false,
279
+ "tie_encoder_decoder": false,
280
+ "tie_word_embeddings": true,
281
+ "tokenizer_class": null,
282
+ "top_k": 50,
283
+ "top_p": 1.0,
284
+ "torch_dtype": "float32",
285
+ "torchscript": false,
286
+ "typical_p": 1.0,
287
+ "use_bfloat16": false,
288
+ "use_weighted_layer_sum": false,
289
+ "vocab_size": 32,
290
+ "xvector_output_dim": 512
291
+ },
292
+ "eos_token_id": 2,
293
+ "forced_decoder_ids": null,
294
+ "is_encoder_decoder": true,
295
+ "max_length": null,
296
+ "model_type": "speech-encoder-decoder",
297
+ "pad_token_id": 1,
298
+ "processor_class": "Wav2Vec2Processor",
299
+ "tie_word_embeddings": false,
300
+ "torch_dtype": "float32",
301
+ "transformers_version": "4.46.3",
302
+ "use_cache": false
303
+ }
checkpoint-29000/generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 0,
3
+ "decoder_start_token_id": 2,
4
+ "early_stopping": true,
5
+ "eos_token_id": 2,
6
+ "forced_bos_token_id": 0,
7
+ "forced_eos_token_id": 2,
8
+ "max_length": 128,
9
+ "no_repeat_ngram_size": 3,
10
+ "num_beams": 4,
11
+ "pad_token_id": 1,
12
+ "transformers_version": "4.46.3"
13
+ }
checkpoint-29000/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70aba1c61e1c77e1aca95ba945b9189c8a5d45eded939dd13ad43cd79379c1ae
3
+ size 804433536
checkpoint-29000/optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2eef776121c682b2eb63f445c736f439a4fb4b00104d44a1b42327ed9239ae89
3
+ size 1575479010
checkpoint-29000/preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "return_attention_mask": false,
8
+ "sampling_rate": 16000
9
+ }
checkpoint-29000/rng_state.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdfd60e90490dbbb552aa355c28f0d83eeb757980adc4347084c8c47548db074
3
+ size 14244
checkpoint-29000/scheduler.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85207000dd7d5fd86930aaf7fe5ae36f2540075686490904e0ea67612f44c838
3
+ size 1064
checkpoint-29000/trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
checkpoint-29000/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f161da6617cf4269393afbc8cb8565cefc83bb8de7610adde49de693563d3f3
3
+ size 5624
config.json ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli",
3
+ "architectures": [
4
+ "SpeechEncoderDecoderModel"
5
+ ],
6
+ "decoder": {
7
+ "_attn_implementation_autoset": true,
8
+ "_name_or_path": "facebook/bart-base",
9
+ "activation_dropout": 0.1,
10
+ "activation_function": "gelu",
11
+ "add_bias_logits": false,
12
+ "add_cross_attention": true,
13
+ "add_final_layer_norm": false,
14
+ "architectures": [
15
+ "BartModel"
16
+ ],
17
+ "attention_dropout": 0.1,
18
+ "bad_words_ids": null,
19
+ "begin_suppress_tokens": null,
20
+ "bos_token_id": 0,
21
+ "chunk_size_feed_forward": 0,
22
+ "classif_dropout": 0.1,
23
+ "classifier_dropout": 0.0,
24
+ "cross_attention_hidden_size": null,
25
+ "d_model": 768,
26
+ "decoder_attention_heads": 12,
27
+ "decoder_ffn_dim": 3072,
28
+ "decoder_layerdrop": 0.0,
29
+ "decoder_layers": 6,
30
+ "decoder_start_token_id": 2,
31
+ "diversity_penalty": 0.0,
32
+ "do_sample": false,
33
+ "dropout": 0.1,
34
+ "early_stopping": true,
35
+ "encoder_attention_heads": 12,
36
+ "encoder_ffn_dim": 3072,
37
+ "encoder_layerdrop": 0.0,
38
+ "encoder_layers": 6,
39
+ "encoder_no_repeat_ngram_size": 0,
40
+ "eos_token_id": 2,
41
+ "exponential_decay_length_penalty": null,
42
+ "finetuning_task": null,
43
+ "forced_bos_token_id": 0,
44
+ "forced_eos_token_id": 2,
45
+ "gradient_checkpointing": false,
46
+ "id2label": {
47
+ "0": "LABEL_0",
48
+ "1": "LABEL_1",
49
+ "2": "LABEL_2"
50
+ },
51
+ "init_std": 0.02,
52
+ "is_decoder": true,
53
+ "is_encoder_decoder": false,
54
+ "label2id": {
55
+ "LABEL_0": 0,
56
+ "LABEL_1": 1,
57
+ "LABEL_2": 2
58
+ },
59
+ "length_penalty": 1.0,
60
+ "max_length": 20,
61
+ "max_position_embeddings": 1024,
62
+ "min_length": 0,
63
+ "model_type": "bart",
64
+ "no_repeat_ngram_size": 3,
65
+ "normalize_before": false,
66
+ "normalize_embedding": true,
67
+ "num_beam_groups": 1,
68
+ "num_beams": 4,
69
+ "num_hidden_layers": 6,
70
+ "num_return_sequences": 1,
71
+ "output_attentions": false,
72
+ "output_hidden_states": false,
73
+ "output_scores": false,
74
+ "pad_token_id": 1,
75
+ "prefix": null,
76
+ "problem_type": null,
77
+ "pruned_heads": {},
78
+ "remove_invalid_values": false,
79
+ "repetition_penalty": 1.0,
80
+ "return_dict": true,
81
+ "return_dict_in_generate": false,
82
+ "scale_embedding": false,
83
+ "sep_token_id": null,
84
+ "suppress_tokens": null,
85
+ "task_specific_params": {
86
+ "summarization": {
87
+ "length_penalty": 1.0,
88
+ "max_length": 128,
89
+ "min_length": 12,
90
+ "num_beams": 4
91
+ },
92
+ "summarization_cnn": {
93
+ "length_penalty": 2.0,
94
+ "max_length": 142,
95
+ "min_length": 56,
96
+ "num_beams": 4
97
+ },
98
+ "summarization_xsum": {
99
+ "length_penalty": 1.0,
100
+ "max_length": 62,
101
+ "min_length": 11,
102
+ "num_beams": 6
103
+ }
104
+ },
105
+ "temperature": 1.0,
106
+ "tf_legacy_loss": false,
107
+ "tie_encoder_decoder": false,
108
+ "tie_word_embeddings": true,
109
+ "tokenizer_class": null,
110
+ "top_k": 50,
111
+ "top_p": 1.0,
112
+ "torch_dtype": "float32",
113
+ "torchscript": false,
114
+ "typical_p": 1.0,
115
+ "use_bfloat16": false,
116
+ "use_cache": true,
117
+ "vocab_size": 50265
118
+ },
119
+ "decoder_start_token_id": 0,
120
+ "encoder": {
121
+ "_attn_implementation_autoset": true,
122
+ "_name_or_path": "facebook/wav2vec2-base-en-voxpopuli-v2",
123
+ "activation_dropout": 0.0,
124
+ "adapter_attn_dim": null,
125
+ "adapter_kernel_size": 3,
126
+ "adapter_stride": 2,
127
+ "add_adapter": true,
128
+ "add_cross_attention": false,
129
+ "apply_spec_augment": true,
130
+ "architectures": [
131
+ "Wav2Vec2ForPreTraining"
132
+ ],
133
+ "attention_dropout": 0.1,
134
+ "bad_words_ids": null,
135
+ "begin_suppress_tokens": null,
136
+ "bos_token_id": 1,
137
+ "chunk_size_feed_forward": 0,
138
+ "classifier_proj_size": 256,
139
+ "codevector_dim": 256,
140
+ "contrastive_logits_temperature": 0.1,
141
+ "conv_bias": false,
142
+ "conv_dim": [
143
+ 512,
144
+ 512,
145
+ 512,
146
+ 512,
147
+ 512,
148
+ 512,
149
+ 512
150
+ ],
151
+ "conv_kernel": [
152
+ 10,
153
+ 3,
154
+ 3,
155
+ 3,
156
+ 3,
157
+ 2,
158
+ 2
159
+ ],
160
+ "conv_stride": [
161
+ 5,
162
+ 2,
163
+ 2,
164
+ 2,
165
+ 2,
166
+ 2,
167
+ 2
168
+ ],
169
+ "cross_attention_hidden_size": null,
170
+ "ctc_loss_reduction": "sum",
171
+ "ctc_zero_infinity": false,
172
+ "decoder_start_token_id": null,
173
+ "diversity_loss_weight": 0.1,
174
+ "diversity_penalty": 0.0,
175
+ "do_sample": false,
176
+ "do_stable_layer_norm": false,
177
+ "early_stopping": false,
178
+ "encoder_no_repeat_ngram_size": 0,
179
+ "eos_token_id": 2,
180
+ "exponential_decay_length_penalty": null,
181
+ "feat_extract_activation": "gelu",
182
+ "feat_extract_norm": "group",
183
+ "feat_proj_dropout": 0.0,
184
+ "feat_quantizer_dropout": 0.0,
185
+ "final_dropout": 0.0,
186
+ "finetuning_task": null,
187
+ "forced_bos_token_id": null,
188
+ "forced_eos_token_id": null,
189
+ "freeze_feat_extract_train": true,
190
+ "hidden_act": "gelu",
191
+ "hidden_dropout": 0.1,
192
+ "hidden_size": 768,
193
+ "id2label": {
194
+ "0": "LABEL_0",
195
+ "1": "LABEL_1"
196
+ },
197
+ "initializer_range": 0.02,
198
+ "intermediate_size": 3072,
199
+ "is_decoder": false,
200
+ "is_encoder_decoder": false,
201
+ "label2id": {
202
+ "LABEL_0": 0,
203
+ "LABEL_1": 1
204
+ },
205
+ "layer_norm_eps": 1e-05,
206
+ "layerdrop": 0.0,
207
+ "length_penalty": 1.0,
208
+ "mask_channel_length": 10,
209
+ "mask_channel_min_space": 1,
210
+ "mask_channel_other": 0.0,
211
+ "mask_channel_prob": 0.0,
212
+ "mask_channel_selection": "static",
213
+ "mask_feature_length": 30,
214
+ "mask_feature_min_masks": 1,
215
+ "mask_feature_prob": 0.3,
216
+ "mask_time_length": 30,
217
+ "mask_time_min_masks": 2,
218
+ "mask_time_min_space": 1,
219
+ "mask_time_other": 0.0,
220
+ "mask_time_prob": 0.25,
221
+ "mask_time_selection": "static",
222
+ "max_length": 20,
223
+ "min_length": 0,
224
+ "model_type": "wav2vec2",
225
+ "no_mask_channel_overlap": false,
226
+ "no_mask_time_overlap": false,
227
+ "no_repeat_ngram_size": 0,
228
+ "num_adapter_layers": 3,
229
+ "num_attention_heads": 12,
230
+ "num_beam_groups": 1,
231
+ "num_beams": 1,
232
+ "num_codevector_groups": 2,
233
+ "num_codevectors_per_group": 320,
234
+ "num_conv_pos_embedding_groups": 16,
235
+ "num_conv_pos_embeddings": 128,
236
+ "num_feat_extract_layers": 7,
237
+ "num_hidden_layers": 12,
238
+ "num_negatives": 100,
239
+ "num_return_sequences": 1,
240
+ "output_attentions": false,
241
+ "output_hidden_size": 768,
242
+ "output_hidden_states": false,
243
+ "output_scores": false,
244
+ "pad_token_id": 0,
245
+ "prefix": null,
246
+ "problem_type": null,
247
+ "proj_codevector_dim": 256,
248
+ "pruned_heads": {},
249
+ "remove_invalid_values": false,
250
+ "repetition_penalty": 1.0,
251
+ "return_dict": true,
252
+ "return_dict_in_generate": false,
253
+ "sep_token_id": null,
254
+ "suppress_tokens": null,
255
+ "task_specific_params": null,
256
+ "tdnn_dilation": [
257
+ 1,
258
+ 2,
259
+ 3,
260
+ 1,
261
+ 1
262
+ ],
263
+ "tdnn_dim": [
264
+ 512,
265
+ 512,
266
+ 512,
267
+ 512,
268
+ 1500
269
+ ],
270
+ "tdnn_kernel": [
271
+ 5,
272
+ 3,
273
+ 3,
274
+ 1,
275
+ 1
276
+ ],
277
+ "temperature": 1.0,
278
+ "tf_legacy_loss": false,
279
+ "tie_encoder_decoder": false,
280
+ "tie_word_embeddings": true,
281
+ "tokenizer_class": null,
282
+ "top_k": 50,
283
+ "top_p": 1.0,
284
+ "torch_dtype": "float32",
285
+ "torchscript": false,
286
+ "typical_p": 1.0,
287
+ "use_bfloat16": false,
288
+ "use_weighted_layer_sum": false,
289
+ "vocab_size": 32,
290
+ "xvector_output_dim": 512
291
+ },
292
+ "eos_token_id": 2,
293
+ "forced_decoder_ids": null,
294
+ "is_encoder_decoder": true,
295
+ "max_length": null,
296
+ "model_type": "speech-encoder-decoder",
297
+ "pad_token_id": 1,
298
+ "processor_class": "Wav2Vec2Processor",
299
+ "tie_word_embeddings": false,
300
+ "torch_dtype": "float32",
301
+ "transformers_version": "4.46.3",
302
+ "use_cache": false
303
+ }
create_model.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import SpeechEncoderDecoderModel, AutoFeatureExtractor, AutoTokenizer
2
+
3
+ # Encoder for speech feature extraction
4
+ encoder_checkpoint = "facebook/wav2vec2-base-en-voxpopuli-v2"
5
+ # Decoder for text generation + its tokenizer
6
+ decoder_checkpoint = "facebook/bart-base"
7
+
8
+ # Path where this initial combined model is saved
9
+ # This path is then used as --model_name_or_path in the fine-tuning script
10
+ # e.g., "./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli"
11
+ INITIAL_MODEL_SAVE_PATH = "path_to_save_initial_model"
12
+
13
+ model = SpeechEncoderDecoderModel.from_encoder_decoder_pretrained(
14
+ encoder_checkpoint,
15
+ decoder_checkpoint,
16
+ encoder_add_adapter=True, # Enables adapter mechanism
17
+ encoder_num_adapter_layers=3, # Specifies 3 adapter layers
18
+ )
19
+
20
+ # Configure encoder properties (example from thesis experiments)
21
+ model.config.encoder.feat_proj_dropout = 0.0
22
+ # model.config.encoder.mask_time_prob = 0.0 # No SpecAugment at initialization
23
+
24
+ # Configure decoder start token, pad token, eos token from the decoder's config
25
+ model.config.decoder_start_token_id = model.decoder.config.bos_token_id
26
+ model.config.pad_token_id = (
27
+ model.decoder.config.pad_token_id
28
+ ) # Or tokenizer.pad_token_id
29
+ model.config.eos_token_id = (
30
+ model.decoder.config.eos_token_id
31
+ ) # Or tokenizer.eos_token_id
32
+
33
+ # Configure generation parameters
34
+ model.config.max_length = 128
35
+ model.config.encoder.layerdrop = 0.0
36
+ model.config.use_cache = False # Important for training
37
+
38
+ # Save the initialized model, feature extractor, and tokenizer
39
+ model.save_pretrained(INITIAL_MODEL_SAVE_PATH)
40
+
41
+ feature_extractor = AutoFeatureExtractor.from_pretrained(encoder_checkpoint)
42
+ feature_extractor.save_pretrained(INITIAL_MODEL_SAVE_PATH)
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained(decoder_checkpoint)
45
+ tokenizer.save_pretrained(INITIAL_MODEL_SAVE_PATH)
46
+
47
+ print(
48
+ f"Initialized model, feature extractor, and tokenizer saved to {INITIAL_MODEL_SAVE_PATH}"
49
+ )
eval_dev_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "eval_dev_loss": 1.0564184188842773,
4
+ "eval_dev_runtime": 121.5437,
5
+ "eval_dev_samples_per_second": 13.09,
6
+ "eval_dev_steps_per_second": 0.14,
7
+ "eval_dev_wer": 0.08554638942253362,
8
+ "eval_samples": 1591
9
+ }
eval_test_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "eval_samples": 1705,
4
+ "eval_test_loss": 1.0758554935455322,
5
+ "eval_test_runtime": 132.2526,
6
+ "eval_test_samples_per_second": 12.892,
7
+ "eval_test_steps_per_second": 0.136,
8
+ "eval_test_wer": 0.08848048503220916
9
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 0,
3
+ "decoder_start_token_id": 2,
4
+ "early_stopping": true,
5
+ "eos_token_id": 2,
6
+ "forced_bos_token_id": 0,
7
+ "forced_eos_token_id": 2,
8
+ "max_length": 128,
9
+ "no_repeat_ngram_size": 3,
10
+ "num_beams": 4,
11
+ "pad_token_id": 1,
12
+ "transformers_version": "4.46.3"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:70aba1c61e1c77e1aca95ba945b9189c8a5d45eded939dd13ad43cd79379c1ae
3
+ size 804433536
preprocessor_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "feature_extractor_type": "Wav2Vec2FeatureExtractor",
4
+ "feature_size": 1,
5
+ "padding_side": "right",
6
+ "padding_value": 0,
7
+ "return_attention_mask": false,
8
+ "sampling_rate": 16000
9
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "mask_token": "<mask>",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "<pad>",
53
+ "sep_token": "</s>",
54
+ "tokenizer_class": "BartTokenizer",
55
+ "trim_offsets": true,
56
+ "unk_token": "<unk>"
57
+ }
train_results.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 20.0,
3
+ "total_flos": 0.0,
4
+ "train_loss": 1.6298207611684346,
5
+ "train_runtime": 35628.5116,
6
+ "train_samples": 167046,
7
+ "train_samples_per_second": 93.771,
8
+ "train_steps_per_second": 0.977
9
+ }
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f161da6617cf4269393afbc8cb8565cefc83bb8de7610adde49de693563d3f3
3
+ size 5624
vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
wandb/.DS_Store ADDED
Binary file (6.15 kB). View file
 
wandb/run-20250515_192303-7xkscxrj/files/config.yaml ADDED
@@ -0,0 +1,1039 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ _attn_implementation_autoset:
2
+ value: true
3
+ _name_or_path:
4
+ value: ./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli
5
+ _wandb:
6
+ value:
7
+ cli_version: 0.19.7
8
+ m:
9
+ - "1": train/global_step
10
+ "6":
11
+ - 3
12
+ "7": []
13
+ - "1": eval/wer
14
+ "5": 1
15
+ "6":
16
+ - 1
17
+ - 3
18
+ "7": []
19
+ - "1": eval/dev_runtime
20
+ "5": 1
21
+ "6":
22
+ - 1
23
+ - 3
24
+ "7": []
25
+ - "1": model_speed2size1_table.path
26
+ "5": 1
27
+ "6":
28
+ - 1
29
+ - 3
30
+ "7": []
31
+ - "1": eval/test_steps_per_second
32
+ "5": 1
33
+ "6":
34
+ - 1
35
+ - 3
36
+ "7": []
37
+ - "1": train/learning_rate
38
+ "5": 1
39
+ "6":
40
+ - 1
41
+ - 3
42
+ "7": []
43
+ - "1": substitutions
44
+ "5": 1
45
+ "6":
46
+ - 1
47
+ - 3
48
+ "7": []
49
+ - "1": eval/samples_per_second
50
+ "5": 1
51
+ "6":
52
+ - 1
53
+ - 3
54
+ "7": []
55
+ - "1": eval/test_runtime
56
+ "5": 1
57
+ "6":
58
+ - 1
59
+ - 3
60
+ "7": []
61
+ - "1": eval/dev_loss
62
+ "5": 1
63
+ "6":
64
+ - 1
65
+ - 3
66
+ "7": []
67
+ - "1": eval/test_samples_per_second
68
+ "5": 1
69
+ "6":
70
+ - 1
71
+ - 3
72
+ "7": []
73
+ - "1": model_speed2size2_table.sha256
74
+ "5": 1
75
+ "6":
76
+ - 1
77
+ - 3
78
+ "7": []
79
+ - "1": model_speed2size2_table._latest_artifact_path
80
+ "5": 1
81
+ "6":
82
+ - 1
83
+ - 3
84
+ "7": []
85
+ - "1": model_speed2size1_table._type
86
+ "5": 1
87
+ "6":
88
+ - 1
89
+ - 3
90
+ "7": []
91
+ - "1": model_speed2size1_table.artifact_path
92
+ "5": 1
93
+ "6":
94
+ - 1
95
+ - 3
96
+ "7": []
97
+ - "1": word_accuracy
98
+ "5": 1
99
+ "6":
100
+ - 1
101
+ - 3
102
+ "7": []
103
+ - "1": eval/loss
104
+ "5": 1
105
+ "6":
106
+ - 1
107
+ - 3
108
+ "7": []
109
+ - "1": eval/steps_per_second
110
+ "5": 1
111
+ "6":
112
+ - 1
113
+ - 3
114
+ "7": []
115
+ - "1": eval/test_loss
116
+ "5": 1
117
+ "6":
118
+ - 1
119
+ - 3
120
+ "7": []
121
+ - "1": model_speed2size1_table.nrows
122
+ "5": 1
123
+ "6":
124
+ - 1
125
+ - 3
126
+ "7": []
127
+ - "1": model_speed2size1_table.sha256
128
+ "5": 1
129
+ "6":
130
+ - 1
131
+ - 3
132
+ "7": []
133
+ - "1": model_speed2size1_table.size
134
+ "5": 1
135
+ "6":
136
+ - 1
137
+ - 3
138
+ "7": []
139
+ - "1": model_speed2size2_table.path
140
+ "5": 1
141
+ "6":
142
+ - 1
143
+ - 3
144
+ "7": []
145
+ - "1": train/epoch
146
+ "5": 1
147
+ "6":
148
+ - 1
149
+ - 3
150
+ "7": []
151
+ - "1": insertions
152
+ "5": 1
153
+ "6":
154
+ - 1
155
+ - 3
156
+ "7": []
157
+ - "1": word_errors
158
+ "5": 1
159
+ "6":
160
+ - 1
161
+ - 3
162
+ "7": []
163
+ - "1": eval/runtime
164
+ "5": 1
165
+ "6":
166
+ - 1
167
+ - 3
168
+ "7": []
169
+ - "1": model_speed2size2_table.ncols
170
+ "5": 1
171
+ "6":
172
+ - 1
173
+ - 3
174
+ "7": []
175
+ - "1": model_speed2size2_table._type
176
+ "5": 1
177
+ "6":
178
+ - 1
179
+ - 3
180
+ "7": []
181
+ - "1": model_speed2size2_table.size
182
+ "5": 1
183
+ "6":
184
+ - 1
185
+ - 3
186
+ "7": []
187
+ - "1": model_speed2size2_table.artifact_path
188
+ "5": 1
189
+ "6":
190
+ - 1
191
+ - 3
192
+ "7": []
193
+ - "1": test_sample_index
194
+ "5": 1
195
+ "6":
196
+ - 1
197
+ - 3
198
+ "7": []
199
+ - "1": train/loss
200
+ "5": 1
201
+ "6":
202
+ - 1
203
+ - 3
204
+ "7": []
205
+ - "1": sentence_errors
206
+ "5": 1
207
+ "6":
208
+ - 1
209
+ - 3
210
+ "7": []
211
+ - "1": eval/dev_samples_per_second
212
+ "5": 1
213
+ "6":
214
+ - 1
215
+ - 3
216
+ "7": []
217
+ - "1": model_speed2size1_table.ncols
218
+ "5": 1
219
+ "6":
220
+ - 1
221
+ - 3
222
+ "7": []
223
+ - "1": train/grad_norm
224
+ "5": 1
225
+ "6":
226
+ - 1
227
+ - 3
228
+ "7": []
229
+ - "1": deletions
230
+ "5": 1
231
+ "6":
232
+ - 1
233
+ - 3
234
+ "7": []
235
+ - "1": eval/dev_wer
236
+ "5": 1
237
+ "6":
238
+ - 1
239
+ - 3
240
+ "7": []
241
+ - "1": model_speed2size2_table.nrows
242
+ "5": 1
243
+ "6":
244
+ - 1
245
+ - 3
246
+ "7": []
247
+ - "1": eval/dev_steps_per_second
248
+ "5": 1
249
+ "6":
250
+ - 1
251
+ - 3
252
+ "7": []
253
+ - "1": eval/test_wer
254
+ "5": 1
255
+ "6":
256
+ - 1
257
+ - 3
258
+ "7": []
259
+ - "1": model_speed2size1_table._latest_artifact_path
260
+ "5": 1
261
+ "6":
262
+ - 1
263
+ - 3
264
+ "7": []
265
+ python_version: 3.11.11
266
+ t:
267
+ "1":
268
+ - 1
269
+ - 5
270
+ - 11
271
+ - 41
272
+ - 49
273
+ - 51
274
+ - 53
275
+ - 55
276
+ - 71
277
+ - 98
278
+ - 100
279
+ "2":
280
+ - 1
281
+ - 5
282
+ - 11
283
+ - 41
284
+ - 49
285
+ - 51
286
+ - 53
287
+ - 55
288
+ - 71
289
+ - 98
290
+ - 100
291
+ "3":
292
+ - 2
293
+ - 7
294
+ - 13
295
+ - 19
296
+ - 23
297
+ - 55
298
+ - 62
299
+ - 66
300
+ "4": 3.11.11
301
+ "5": 0.19.7
302
+ "6": 4.46.3
303
+ "8":
304
+ - 5
305
+ "9":
306
+ "1": transformers_trainer
307
+ "12": 0.19.7
308
+ "13": linux-x86_64
309
+ visualize:
310
+ model_speed2size1:
311
+ panel_config:
312
+ fieldSettings:
313
+ x: Time per step
314
+ "y": Trainable parameters
315
+ panelDefId: wandb/scatter/v0
316
+ stringSettings:
317
+ title: ""
318
+ transform:
319
+ name: tableWithLeafColNames
320
+ userQuery:
321
+ queryFields:
322
+ - args:
323
+ - name: runSets
324
+ value: ${runSets}
325
+ fields:
326
+ - fields: []
327
+ name: id
328
+ - fields: []
329
+ name: name
330
+ - fields: []
331
+ name: _defaultColorIndex
332
+ - args:
333
+ - name: tableKey
334
+ value: model_speed2size1_table
335
+ fields: []
336
+ name: summaryTable
337
+ name: runSets
338
+ panel_type: Vega2
339
+ model_speed2size2:
340
+ panel_config:
341
+ fieldSettings:
342
+ x: Time per step
343
+ "y": Total parameters
344
+ panelDefId: wandb/scatter/v0
345
+ stringSettings:
346
+ title: ""
347
+ transform:
348
+ name: tableWithLeafColNames
349
+ userQuery:
350
+ queryFields:
351
+ - args:
352
+ - name: runSets
353
+ value: ${runSets}
354
+ fields:
355
+ - fields: []
356
+ name: id
357
+ - fields: []
358
+ name: name
359
+ - fields: []
360
+ name: _defaultColorIndex
361
+ - args:
362
+ - name: tableKey
363
+ value: model_speed2size2_table
364
+ fields: []
365
+ name: summaryTable
366
+ name: runSets
367
+ panel_type: Vega2
368
+ accelerator_config:
369
+ value:
370
+ dispatch_batches: null
371
+ even_batches: true
372
+ gradient_accumulation_kwargs: null
373
+ non_blocking: false
374
+ split_batches: false
375
+ use_seedable_sampler: true
376
+ adafactor:
377
+ value: false
378
+ adam_beta1:
379
+ value: 0.9
380
+ adam_beta2:
381
+ value: 0.999
382
+ adam_epsilon:
383
+ value: 1e-08
384
+ add_cross_attention:
385
+ value: false
386
+ architectures:
387
+ value:
388
+ - SpeechEncoderDecoderModel
389
+ auto_find_batch_size:
390
+ value: false
391
+ average_tokens_across_devices:
392
+ value: false
393
+ bad_words_ids:
394
+ value: null
395
+ batch_eval_metrics:
396
+ value: false
397
+ begin_suppress_tokens:
398
+ value: null
399
+ bf16:
400
+ value: true
401
+ bf16_full_eval:
402
+ value: false
403
+ bos_token_id:
404
+ value: null
405
+ chunk_size_feed_forward:
406
+ value: 0
407
+ cross_attention_hidden_size:
408
+ value: null
409
+ data_seed:
410
+ value: null
411
+ dataloader_drop_last:
412
+ value: false
413
+ dataloader_num_workers:
414
+ value: 16
415
+ dataloader_persistent_workers:
416
+ value: false
417
+ dataloader_pin_memory:
418
+ value: true
419
+ dataloader_prefetch_factor:
420
+ value: 2
421
+ ddp_backend:
422
+ value: null
423
+ ddp_broadcast_buffers:
424
+ value: null
425
+ ddp_bucket_cap_mb:
426
+ value: null
427
+ ddp_find_unused_parameters:
428
+ value: null
429
+ ddp_timeout:
430
+ value: 1800
431
+ debug:
432
+ value: []
433
+ decoder:
434
+ value:
435
+ _attn_implementation_autoset: true
436
+ _name_or_path: facebook/bart-base
437
+ activation_dropout: 0.1
438
+ activation_function: gelu
439
+ add_bias_logits: false
440
+ add_cross_attention: true
441
+ add_final_layer_norm: false
442
+ architectures:
443
+ - BartModel
444
+ attention_dropout: 0.1
445
+ bad_words_ids: null
446
+ begin_suppress_tokens: null
447
+ bos_token_id: 0
448
+ chunk_size_feed_forward: 0
449
+ classif_dropout: 0.1
450
+ classifier_dropout: 0
451
+ cross_attention_hidden_size: null
452
+ d_model: 768
453
+ decoder_attention_heads: 12
454
+ decoder_ffn_dim: 3072
455
+ decoder_layerdrop: 0
456
+ decoder_layers: 6
457
+ decoder_start_token_id: 2
458
+ diversity_penalty: 0
459
+ do_sample: false
460
+ dropout: 0.1
461
+ early_stopping: true
462
+ encoder_attention_heads: 12
463
+ encoder_ffn_dim: 3072
464
+ encoder_layerdrop: 0
465
+ encoder_layers: 6
466
+ encoder_no_repeat_ngram_size: 0
467
+ eos_token_id: 2
468
+ exponential_decay_length_penalty: null
469
+ finetuning_task: null
470
+ forced_bos_token_id: 0
471
+ forced_eos_token_id: 2
472
+ gradient_checkpointing: false
473
+ id2label:
474
+ "0": LABEL_0
475
+ "1": LABEL_1
476
+ "2": LABEL_2
477
+ init_std: 0.02
478
+ is_decoder: true
479
+ is_encoder_decoder: false
480
+ label2id:
481
+ LABEL_0: 0
482
+ LABEL_1: 1
483
+ LABEL_2: 2
484
+ length_penalty: 1
485
+ max_length: 20
486
+ max_position_embeddings: 1024
487
+ min_length: 0
488
+ model_type: bart
489
+ no_repeat_ngram_size: 3
490
+ normalize_before: false
491
+ normalize_embedding: true
492
+ num_beam_groups: 1
493
+ num_beams: 4
494
+ num_hidden_layers: 6
495
+ num_return_sequences: 1
496
+ output_attentions: false
497
+ output_hidden_states: false
498
+ output_scores: false
499
+ pad_token_id: 1
500
+ prefix: null
501
+ problem_type: null
502
+ remove_invalid_values: false
503
+ repetition_penalty: 1
504
+ return_dict: true
505
+ return_dict_in_generate: false
506
+ scale_embedding: false
507
+ sep_token_id: null
508
+ suppress_tokens: null
509
+ task_specific_params:
510
+ summarization:
511
+ length_penalty: 1
512
+ max_length: 128
513
+ min_length: 12
514
+ num_beams: 4
515
+ summarization_cnn:
516
+ length_penalty: 2
517
+ max_length: 142
518
+ min_length: 56
519
+ num_beams: 4
520
+ summarization_xsum:
521
+ length_penalty: 1
522
+ max_length: 62
523
+ min_length: 11
524
+ num_beams: 6
525
+ temperature: 1
526
+ tf_legacy_loss: false
527
+ tie_encoder_decoder: false
528
+ tie_word_embeddings: true
529
+ tokenizer_class: null
530
+ top_k: 50
531
+ top_p: 1
532
+ torch_dtype: float32
533
+ torchscript: false
534
+ typical_p: 1
535
+ use_bfloat16: false
536
+ use_cache: true
537
+ vocab_size: 50265
538
+ decoder_start_token_id:
539
+ value: 0
540
+ deepspeed:
541
+ value: null
542
+ disable_tqdm:
543
+ value: false
544
+ dispatch_batches:
545
+ value: null
546
+ diversity_penalty:
547
+ value: 0
548
+ do_eval:
549
+ value: true
550
+ do_predict:
551
+ value: true
552
+ do_sample:
553
+ value: false
554
+ do_train:
555
+ value: true
556
+ early_stopping:
557
+ value: false
558
+ encoder:
559
+ value:
560
+ _attn_implementation_autoset: true
561
+ _name_or_path: facebook/wav2vec2-base-en-voxpopuli-v2
562
+ activation_dropout: 0
563
+ adapter_attn_dim: null
564
+ adapter_kernel_size: 3
565
+ adapter_stride: 2
566
+ add_adapter: true
567
+ add_cross_attention: false
568
+ apply_spec_augment: true
569
+ architectures:
570
+ - Wav2Vec2ForPreTraining
571
+ attention_dropout: 0.1
572
+ bad_words_ids: null
573
+ begin_suppress_tokens: null
574
+ bos_token_id: 1
575
+ chunk_size_feed_forward: 0
576
+ classifier_proj_size: 256
577
+ codevector_dim: 256
578
+ contrastive_logits_temperature: 0.1
579
+ conv_bias: false
580
+ conv_dim:
581
+ - 512
582
+ - 512
583
+ - 512
584
+ - 512
585
+ - 512
586
+ - 512
587
+ - 512
588
+ conv_kernel:
589
+ - 10
590
+ - 3
591
+ - 3
592
+ - 3
593
+ - 3
594
+ - 2
595
+ - 2
596
+ conv_stride:
597
+ - 5
598
+ - 2
599
+ - 2
600
+ - 2
601
+ - 2
602
+ - 2
603
+ - 2
604
+ cross_attention_hidden_size: null
605
+ ctc_loss_reduction: sum
606
+ ctc_zero_infinity: false
607
+ decoder_start_token_id: null
608
+ diversity_loss_weight: 0.1
609
+ diversity_penalty: 0
610
+ do_sample: false
611
+ do_stable_layer_norm: false
612
+ early_stopping: false
613
+ encoder_no_repeat_ngram_size: 0
614
+ eos_token_id: 2
615
+ exponential_decay_length_penalty: null
616
+ feat_extract_activation: gelu
617
+ feat_extract_norm: group
618
+ feat_proj_dropout: 0
619
+ feat_quantizer_dropout: 0
620
+ final_dropout: 0
621
+ finetuning_task: null
622
+ forced_bos_token_id: null
623
+ forced_eos_token_id: null
624
+ freeze_feat_extract_train: true
625
+ hidden_act: gelu
626
+ hidden_dropout: 0.1
627
+ hidden_size: 768
628
+ id2label:
629
+ "0": LABEL_0
630
+ "1": LABEL_1
631
+ initializer_range: 0.02
632
+ intermediate_size: 3072
633
+ is_decoder: false
634
+ is_encoder_decoder: false
635
+ label2id:
636
+ LABEL_0: 0
637
+ LABEL_1: 1
638
+ layer_norm_eps: 1e-05
639
+ layerdrop: 0
640
+ length_penalty: 1
641
+ mask_channel_length: 10
642
+ mask_channel_min_space: 1
643
+ mask_channel_other: 0
644
+ mask_channel_prob: 0
645
+ mask_channel_selection: static
646
+ mask_feature_length: 30
647
+ mask_feature_min_masks: 1
648
+ mask_feature_prob: 0.3
649
+ mask_time_length: 30
650
+ mask_time_min_masks: 2
651
+ mask_time_min_space: 1
652
+ mask_time_other: 0
653
+ mask_time_prob: 0.25
654
+ mask_time_selection: static
655
+ max_length: 20
656
+ min_length: 0
657
+ model_type: wav2vec2
658
+ no_mask_channel_overlap: false
659
+ no_mask_time_overlap: false
660
+ no_repeat_ngram_size: 0
661
+ num_adapter_layers: 3
662
+ num_attention_heads: 12
663
+ num_beam_groups: 1
664
+ num_beams: 1
665
+ num_codevector_groups: 2
666
+ num_codevectors_per_group: 320
667
+ num_conv_pos_embedding_groups: 16
668
+ num_conv_pos_embeddings: 128
669
+ num_feat_extract_layers: 7
670
+ num_hidden_layers: 12
671
+ num_negatives: 100
672
+ num_return_sequences: 1
673
+ output_attentions: false
674
+ output_hidden_size: 768
675
+ output_hidden_states: false
676
+ output_scores: false
677
+ pad_token_id: 0
678
+ prefix: null
679
+ problem_type: null
680
+ proj_codevector_dim: 256
681
+ remove_invalid_values: false
682
+ repetition_penalty: 1
683
+ return_dict: true
684
+ return_dict_in_generate: false
685
+ sep_token_id: null
686
+ suppress_tokens: null
687
+ task_specific_params: null
688
+ tdnn_dilation:
689
+ - 1
690
+ - 2
691
+ - 3
692
+ - 1
693
+ - 1
694
+ tdnn_dim:
695
+ - 512
696
+ - 512
697
+ - 512
698
+ - 512
699
+ - 1500
700
+ tdnn_kernel:
701
+ - 5
702
+ - 3
703
+ - 3
704
+ - 1
705
+ - 1
706
+ temperature: 1
707
+ tf_legacy_loss: false
708
+ tie_encoder_decoder: false
709
+ tie_word_embeddings: true
710
+ tokenizer_class: null
711
+ top_k: 50
712
+ top_p: 1
713
+ torch_dtype: float32
714
+ torchscript: false
715
+ typical_p: 1
716
+ use_bfloat16: false
717
+ use_weighted_layer_sum: false
718
+ vocab_size: 32
719
+ xvector_output_dim: 512
720
+ encoder_no_repeat_ngram_size:
721
+ value: 0
722
+ eos_token_id:
723
+ value: 2
724
+ eval_accumulation_steps:
725
+ value: null
726
+ eval_delay:
727
+ value: 0
728
+ eval_do_concat_batches:
729
+ value: true
730
+ eval_on_start:
731
+ value: false
732
+ eval_steps:
733
+ value: 1000
734
+ eval_strategy:
735
+ value: steps
736
+ eval_use_gather_object:
737
+ value: false
738
+ evaluation_strategy:
739
+ value: null
740
+ exponential_decay_length_penalty:
741
+ value: null
742
+ finetuning_task:
743
+ value: null
744
+ forced_bos_token_id:
745
+ value: null
746
+ forced_decoder_ids:
747
+ value: null
748
+ forced_eos_token_id:
749
+ value: null
750
+ fp16:
751
+ value: false
752
+ fp16_backend:
753
+ value: auto
754
+ fp16_full_eval:
755
+ value: false
756
+ fp16_opt_level:
757
+ value: O1
758
+ fsdp:
759
+ value: []
760
+ fsdp_config:
761
+ value:
762
+ min_num_params: 0
763
+ xla: false
764
+ xla_fsdp_grad_ckpt: false
765
+ xla_fsdp_v2: false
766
+ fsdp_min_num_params:
767
+ value: 0
768
+ fsdp_transformer_layer_cls_to_wrap:
769
+ value: null
770
+ full_determinism:
771
+ value: false
772
+ generation_config:
773
+ value: null
774
+ generation_max_length:
775
+ value: null
776
+ generation_num_beams:
777
+ value: null
778
+ gradient_accumulation_steps:
779
+ value: 1
780
+ gradient_checkpointing:
781
+ value: false
782
+ gradient_checkpointing_kwargs:
783
+ value: null
784
+ greater_is_better:
785
+ value: false
786
+ group_by_length:
787
+ value: false
788
+ half_precision_backend:
789
+ value: auto
790
+ hub_always_push:
791
+ value: false
792
+ hub_model_id:
793
+ value: null
794
+ hub_private_repo:
795
+ value: false
796
+ hub_strategy:
797
+ value: every_save
798
+ hub_token:
799
+ value: <HUB_TOKEN>
800
+ id2label:
801
+ value:
802
+ "0": LABEL_0
803
+ "1": LABEL_1
804
+ ignore_data_skip:
805
+ value: false
806
+ include_for_metrics:
807
+ value: []
808
+ include_inputs_for_metrics:
809
+ value: false
810
+ include_num_input_tokens_seen:
811
+ value: false
812
+ include_tokens_per_second:
813
+ value: false
814
+ is_decoder:
815
+ value: false
816
+ is_encoder_decoder:
817
+ value: true
818
+ jit_mode_eval:
819
+ value: false
820
+ label_names:
821
+ value: null
822
+ label_smoothing_factor:
823
+ value: 0.05
824
+ label2id:
825
+ value:
826
+ LABEL_0: 0
827
+ LABEL_1: 1
828
+ learning_rate:
829
+ value: 0.0001
830
+ length_column_name:
831
+ value: input_length
832
+ length_penalty:
833
+ value: 1
834
+ load_best_model_at_end:
835
+ value: true
836
+ local_rank:
837
+ value: 0
838
+ log_level:
839
+ value: passive
840
+ log_level_replica:
841
+ value: warning
842
+ log_on_each_node:
843
+ value: true
844
+ logging_dir:
845
+ value: ./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli/t1_new1_spec/runs/May15_19-23-02_achjo
846
+ logging_first_step:
847
+ value: false
848
+ logging_nan_inf_filter:
849
+ value: true
850
+ logging_steps:
851
+ value: 10
852
+ logging_strategy:
853
+ value: steps
854
+ lr_scheduler_kwargs:
855
+ value:
856
+ min_lr: 5e-09
857
+ lr_scheduler_type:
858
+ value: cosine_with_min_lr
859
+ max_grad_norm:
860
+ value: 1
861
+ max_length:
862
+ value: null
863
+ max_steps:
864
+ value: -1
865
+ metric_for_best_model:
866
+ value: wer
867
+ min_length:
868
+ value: 0
869
+ model/num_parameters:
870
+ value: 201096832
871
+ model_type:
872
+ value: speech-encoder-decoder
873
+ mp_parameters:
874
+ value: ""
875
+ neftune_noise_alpha:
876
+ value: null
877
+ no_cuda:
878
+ value: false
879
+ no_repeat_ngram_size:
880
+ value: 0
881
+ num_beam_groups:
882
+ value: 1
883
+ num_beams:
884
+ value: 1
885
+ num_return_sequences:
886
+ value: 1
887
+ num_train_epochs:
888
+ value: 20
889
+ optim:
890
+ value: adamw_torch
891
+ optim_args:
892
+ value: null
893
+ optim_target_modules:
894
+ value: null
895
+ output_attentions:
896
+ value: false
897
+ output_dir:
898
+ value: ./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli/t1_new1_spec
899
+ output_hidden_states:
900
+ value: false
901
+ output_scores:
902
+ value: false
903
+ overwrite_output_dir:
904
+ value: true
905
+ pad_token_id:
906
+ value: 1
907
+ past_index:
908
+ value: -1
909
+ per_device_eval_batch_size:
910
+ value: 96
911
+ per_device_train_batch_size:
912
+ value: 96
913
+ per_gpu_eval_batch_size:
914
+ value: null
915
+ per_gpu_train_batch_size:
916
+ value: null
917
+ predict_with_generate:
918
+ value: true
919
+ prediction_loss_only:
920
+ value: false
921
+ prefix:
922
+ value: null
923
+ problem_type:
924
+ value: null
925
+ processor_class:
926
+ value: Wav2Vec2Processor
927
+ push_to_hub:
928
+ value: false
929
+ push_to_hub_model_id:
930
+ value: null
931
+ push_to_hub_organization:
932
+ value: null
933
+ push_to_hub_token:
934
+ value: <PUSH_TO_HUB_TOKEN>
935
+ ray_scope:
936
+ value: last
937
+ remove_invalid_values:
938
+ value: false
939
+ remove_unused_columns:
940
+ value: true
941
+ repetition_penalty:
942
+ value: 1
943
+ report_to:
944
+ value:
945
+ - wandb
946
+ restore_callback_states_from_checkpoint:
947
+ value: false
948
+ resume_from_checkpoint:
949
+ value: null
950
+ return_dict:
951
+ value: true
952
+ return_dict_in_generate:
953
+ value: false
954
+ run_name:
955
+ value: facebook/voxpopuli_en_split-train_wav2vec2-bart_bs96_lr0.0001_ep20.0
956
+ save_on_each_node:
957
+ value: false
958
+ save_only_model:
959
+ value: false
960
+ save_safetensors:
961
+ value: true
962
+ save_steps:
963
+ value: 1000
964
+ save_strategy:
965
+ value: steps
966
+ save_total_limit:
967
+ value: 1
968
+ seed:
969
+ value: 42
970
+ sep_token_id:
971
+ value: null
972
+ skip_memory_metrics:
973
+ value: true
974
+ sortish_sampler:
975
+ value: false
976
+ split_batches:
977
+ value: null
978
+ suppress_tokens:
979
+ value: null
980
+ task_specific_params:
981
+ value: null
982
+ temperature:
983
+ value: 1
984
+ tf_legacy_loss:
985
+ value: false
986
+ tf32:
987
+ value: null
988
+ tie_encoder_decoder:
989
+ value: false
990
+ tie_word_embeddings:
991
+ value: false
992
+ tokenizer_class:
993
+ value: null
994
+ top_k:
995
+ value: 50
996
+ top_p:
997
+ value: 1
998
+ torch_compile:
999
+ value: false
1000
+ torch_compile_backend:
1001
+ value: null
1002
+ torch_compile_mode:
1003
+ value: null
1004
+ torch_dtype:
1005
+ value: float32
1006
+ torch_empty_cache_steps:
1007
+ value: null
1008
+ torchdynamo:
1009
+ value: null
1010
+ torchscript:
1011
+ value: false
1012
+ tpu_metrics_debug:
1013
+ value: false
1014
+ tpu_num_cores:
1015
+ value: null
1016
+ transformers_version:
1017
+ value: 4.46.3
1018
+ typical_p:
1019
+ value: 1
1020
+ use_bfloat16:
1021
+ value: false
1022
+ use_cache:
1023
+ value: false
1024
+ use_cpu:
1025
+ value: false
1026
+ use_ipex:
1027
+ value: false
1028
+ use_legacy_prediction_loop:
1029
+ value: false
1030
+ use_liger_kernel:
1031
+ value: false
1032
+ use_mps_device:
1033
+ value: false
1034
+ warmup_ratio:
1035
+ value: 0
1036
+ warmup_steps:
1037
+ value: 2000
1038
+ weight_decay:
1039
+ value: 0.01
wandb/run-20250515_192303-7xkscxrj/files/media/table/model_speed2size1_table_3555_34483c9cf24b143db620.table.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"columns": ["Time per step", "Trainable parameters"], "data": [[0.054152575027465746, 196896384]]}
wandb/run-20250515_192303-7xkscxrj/files/media/table/model_speed2size2_table_3556_ffc3f22eaf8a279337f3.table.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"columns": ["Time per step", "Total parameters"], "data": [[0.054152575027465746, 201096832]]}
wandb/run-20250515_192303-7xkscxrj/files/output.log ADDED
The diff for this file is too large to render. See raw diff
 
wandb/run-20250515_192303-7xkscxrj/files/requirements.txt ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ seaborn==0.13.2
2
+ certifi==2024.12.14
3
+ aiohappyeyeballs==2.4.4
4
+ filelock==3.16.1
5
+ executing==2.1.0
6
+ nvidia-nvjitlink-cu12==12.4.127
7
+ SecretStorage==3.3.3
8
+ mkl_fft==1.3.11
9
+ installer==0.7.0
10
+ cycler==0.12.1
11
+ keyring==25.6.0
12
+ idna==3.10
13
+ mpmath==1.3.0
14
+ prompt_toolkit==3.0.48
15
+ urllib3==2.3.0
16
+ aiohttp==3.11.11
17
+ jaraco.classes==3.4.0
18
+ RapidFuzz==3.11.0
19
+ triton==3.1.0
20
+ click==8.1.8
21
+ regex==2024.11.6
22
+ joblib==1.4.2
23
+ pyparsing==3.2.0
24
+ attrs==24.3.0
25
+ typing_extensions==4.12.2
26
+ jedi==0.19.2
27
+ pkginfo==1.12.1.2
28
+ stack-data==0.6.3
29
+ huggingface-hub==0.27.0
30
+ multidict==6.1.0
31
+ fastjsonschema==2.21.1
32
+ cleo==2.1.0
33
+ pydantic_core==2.27.2
34
+ zipp==3.21.0
35
+ python-dateutil==2.9.0.post0
36
+ trove-classifiers==2025.5.1.12
37
+ contourpy==1.3.1
38
+ torchaudio==2.5.1
39
+ annotated-types==0.7.0
40
+ scikit-learn==1.6.0
41
+ lazy_loader==0.4
42
+ smmap==5.0.1
43
+ jiwer==3.0.5
44
+ requests==2.32.3
45
+ gitdb==4.0.11
46
+ numpy==2.0.2
47
+ sentry-sdk==2.19.2
48
+ gmpy2==2.2.1
49
+ sniffio==1.3.1
50
+ build==1.2.2.post1
51
+ nvidia-nvtx-cu12==12.4.127
52
+ nvidia-nccl-cu12==2.21.5
53
+ traitlets==5.14.3
54
+ nvidia-cuda-runtime-cu12==12.4.127
55
+ pillow==11.0.0
56
+ packaging==24.2
57
+ jeepney==0.9.0
58
+ pexpect==4.9.0
59
+ accelerate==1.4.0
60
+ httpx==0.28.1
61
+ jaraco.context==6.0.1
62
+ multiprocess==0.70.16
63
+ torchvision==0.20.1
64
+ virtualenv==20.31.0
65
+ nvidia-cufft-cu12==11.2.1.3
66
+ xxhash==3.5.0
67
+ sympy==1.13.1
68
+ tqdm==4.67.1
69
+ pbs-installer==2025.4.9
70
+ wheel==0.45.1
71
+ pyzmq==26.2.0
72
+ pyarrow==18.1.0
73
+ importlib_metadata==8.7.0
74
+ pure_eval==0.2.3
75
+ tomlkit==0.13.2
76
+ pandas==2.2.3
77
+ safetensors==0.4.5
78
+ crashtest==0.4.1
79
+ propcache==0.2.1
80
+ comm==0.2.2
81
+ ipython==8.31.0
82
+ protobuf==5.29.2
83
+ mkl-service==2.4.0
84
+ cffi==1.17.1
85
+ PySocks==1.7.1
86
+ networkx==3.4.2
87
+ poetry==2.1.3
88
+ debugpy==1.8.11
89
+ GitPython==3.1.43
90
+ pyproject_hooks==1.2.0
91
+ ptyprocess==0.7.0
92
+ requests-toolbelt==1.0.0
93
+ setproctitle==1.3.4
94
+ ipykernel==6.29.5
95
+ pydantic-settings==2.7.0
96
+ nvidia-cuda-cupti-cu12==12.4.127
97
+ threadpoolctl==3.5.0
98
+ jaraco.functools==4.1.0
99
+ tokenizers==0.20.3
100
+ python-dotenv==1.0.1
101
+ numba==0.60.0
102
+ dill==0.3.8
103
+ msgpack==1.1.0
104
+ tzdata==2024.2
105
+ audioread==3.0.1
106
+ pip==25.1
107
+ nvidia-cusolver-cu12==11.6.1.9
108
+ yarl==1.18.3
109
+ pydantic==2.10.4
110
+ shellingham==1.5.4
111
+ librosa==0.10.2.post1
112
+ Pygments==2.18.0
113
+ docker-pycreds==0.4.0
114
+ fsspec==2024.9.0
115
+ anyio==4.9.0
116
+ fonttools==4.55.3
117
+ more-itertools==10.7.0
118
+ tornado==6.4.2
119
+ backports.tarfile==1.2.0
120
+ transformers==4.46.3
121
+ dulwich==0.22.8
122
+ psutil==6.1.1
123
+ nvidia-cuda-nvrtc-cu12==12.4.127
124
+ six==1.17.0
125
+ wcwidth==0.2.13
126
+ asttokens==3.0.0
127
+ platformdirs==4.3.6
128
+ jupyter_client==8.6.3
129
+ pytz==2024.2
130
+ decorator==5.1.1
131
+ nvidia-cublas-cu12==12.4.5.8
132
+ matplotlib==3.10.0
133
+ pooch==1.8.2
134
+ aiosignal==1.3.2
135
+ httpcore==1.0.9
136
+ Brotli==1.0.9
137
+ parso==0.8.4
138
+ nvidia-cusparse-cu12==12.3.1.170
139
+ Jinja2==3.1.5
140
+ datasets==3.5.1
141
+ poetry-core==2.1.3
142
+ PyYAML==6.0.2
143
+ MarkupSafe==3.0.2
144
+ mkl_random==1.2.8
145
+ evaluate==0.4.3
146
+ matplotlib-inline==0.1.7
147
+ frozenlist==1.5.0
148
+ kiwisolver==1.4.7
149
+ zstandard==0.23.0
150
+ nvidia-curand-cu12==10.3.5.147
151
+ soxr==0.5.0.post1
152
+ CacheControl==0.14.3
153
+ soundfile==0.12.1
154
+ h11==0.16.0
155
+ jupyter_core==5.7.2
156
+ pycparser==2.22
157
+ nvidia-cudnn-cu12==9.1.0.70
158
+ peft==0.14.0
159
+ scipy==1.14.1
160
+ wandb==0.19.7
161
+ charset-normalizer==3.4.0
162
+ cryptography==44.0.3
163
+ distlib==0.3.9
164
+ findpython==0.6.3
165
+ setuptools==75.6.0
166
+ torch==2.5.1
167
+ llvmlite==0.43.0
168
+ nest-asyncio==1.6.0
169
+ more-itertools==10.3.0
170
+ inflect==7.3.1
171
+ typing_extensions==4.12.2
172
+ jaraco.context==5.3.0
173
+ tomli==2.0.1
174
+ platformdirs==4.2.2
175
+ zipp==3.19.2
176
+ jaraco.functools==4.0.1
177
+ packaging==24.2
178
+ typeguard==4.3.0
179
+ wheel==0.43.0
180
+ autocommand==2.2.2
181
+ backports.tarfile==1.2.0
182
+ jaraco.collections==5.1.0
183
+ jaraco.text==3.12.1
184
+ importlib_metadata==8.0.0
wandb/run-20250515_192303-7xkscxrj/files/wandb-metadata.json ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "os": "Linux-5.15.0-1079-azure-x86_64-with-glibc2.31",
3
+ "python": "CPython 3.11.11",
4
+ "startedAt": "2025-05-15T19:23:03.492613Z",
5
+ "args": [
6
+ "--dataset_name=facebook/voxpopuli",
7
+ "--model_name_or_path=./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli",
8
+ "--dataset_config_name=en",
9
+ "--train_split_name=train",
10
+ "--eval_split_name=validation",
11
+ "--test_split_name=test",
12
+ "--output_dir=./seq2seq_wav2vec2_bart-base_24k-en-voxpopuli/t1_new1_spec",
13
+ "--preprocessing_num_workers=1",
14
+ "--dataloader_num_workers=16",
15
+ "--dataloader_prefetch_factor=2",
16
+ "--length_column_name=input_length",
17
+ "--overwrite_output_dir",
18
+ "--num_train_epochs=20",
19
+ "--per_device_train_batch_size=96",
20
+ "--per_device_eval_batch_size=96",
21
+ "--gradient_accumulation_steps=1",
22
+ "--learning_rate=1e-4",
23
+ "--label_smoothing_factor=0.05",
24
+ "--apply_spec_augment",
25
+ "--mask_time_prob=0.25",
26
+ "--mask_time_length=30",
27
+ "--mask_time_min_masks=2",
28
+ "--mask_feature_prob=0.3",
29
+ "--mask_feature_length=30",
30
+ "--mask_feature_min_masks=1",
31
+ "--weight_decay=0.01",
32
+ "--lr_scheduler_type=cosine_with_min_lr",
33
+ "--lr_scheduler_kwargs={\"min_lr\": 5e-9}",
34
+ "--warmup_steps=2000",
35
+ "--eval_strategy=steps",
36
+ "--text_column_name=normalized_text",
37
+ "--save_strategy=steps",
38
+ "--eval_steps=1000",
39
+ "--save_steps=1000",
40
+ "--load_best_model_at_end",
41
+ "--metric_for_best_model=wer",
42
+ "--greater_is_better=False",
43
+ "--logging_steps=10",
44
+ "--save_total_limit=1",
45
+ "--freeze_feature_encoder",
46
+ "--bf16",
47
+ "--task=transcribe",
48
+ "--predict_with_generate",
49
+ "--do_train",
50
+ "--do_eval",
51
+ "--do_predict",
52
+ "--do_lower_case",
53
+ "--trust_remote_code",
54
+ "--report_to=wandb",
55
+ "--sclite_path=/home/azureuser/media-disk/mh_dp/SCTK/bin/sclite",
56
+ "--wandb_project=seq2seq_encoder-decoder_vox",
57
+ "--cache_dir=/home/azureuser/media-disk/mh_dp/preprocessed_dataset_voxpopuli"
58
+ ],
59
+ "program": "/media/disk/mh_dp/run_speech_recognition_seq2seq.py",
60
+ "codePath": "run_speech_recognition_seq2seq.py",
61
+ "git": {
62
+ "remote": "https://github.com/hornikmatej/thesis_mit.git",
63
+ "commit": "f785b399a218c31f74efa57fa6057a8f5848df90"
64
+ },
65
+ "email": "[email protected]",
66
+ "root": "/media/disk/mh_dp",
67
+ "host": "achjo",
68
+ "executable": "/media/disk/conda-envs/mh_dp/bin/python",
69
+ "codePathLocal": "run_speech_recognition_seq2seq.py",
70
+ "cpu_count": 24,
71
+ "cpu_count_logical": 24,
72
+ "gpu": "NVIDIA A100 80GB PCIe",
73
+ "gpu_count": 1,
74
+ "disk": {
75
+ "/": {
76
+ "total": "126759518208",
77
+ "used": "121216040960"
78
+ }
79
+ },
80
+ "memory": {
81
+ "total": "232206929920"
82
+ },
83
+ "cpu": {
84
+ "count": 24,
85
+ "countLogical": 24
86
+ },
87
+ "gpu_nvidia": [
88
+ {
89
+ "name": "NVIDIA A100 80GB PCIe",
90
+ "memoryTotal": "85899345920",
91
+ "cudaCores": 6912,
92
+ "architecture": "Ampere"
93
+ }
94
+ ],
95
+ "cudaVersion": "12.4"
96
+ }
wandb/run-20250515_192303-7xkscxrj/files/wandb-summary.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"eval/loss":1.0561457872390747,"train_samples_per_second":93.771,"train_runtime":35628.5116,"deletions":2.1,"train/grad_norm":3.1310455799102783,"sentence_errors":69.79,"eval/dev_wer":0.08554638942253362,"train/learning_rate":5e-09,"eval/dev_runtime":121.5437,"eval/test_runtime":132.2526,"train/epoch":20,"eval/test_samples_per_second":12.892,"eval/wer":0.08608317323991412,"eval/runtime":122.4086,"_runtime":35917.482322344,"model_speed2size1_table":{"artifact_path":"wandb-client-artifact://22ryvcq5sfoxrkdlg5snrezdqeux1vlyv16ngbo07h3jbzffqjwebkvyz3zmsi2e0yxfoypfkb3m9qh2gmzp1wad6w8ijltxguisyd9kp4eji86yqh2tjqpzuu78dv8a/model_speed2size1_table.table.json","_latest_artifact_path":"wandb-client-artifact://bmhjxqeluon6nds3foy3om07pd3ur8vjndrhq5f4l4sfxeikus6kk0lsyxy74p6omxe5ipufpt1xab1spvmr9j535tlw1jd59813g4tl3ud02gtm9j0wv7ipfp9b61wq:latest/model_speed2size1_table.table.json","path":"media/table/model_speed2size1_table_3555_34483c9cf24b143db620.table.json","ncols":2,"nrows":1,"_type":"table-file","sha256":"34483c9cf24b143db6206cc8a337dd90c28037eadcc75c16498b0e74c09b0a94","size":99},"eval/dev_steps_per_second":0.14,"train/global_step":34820,"eval/dev_loss":1.0564184188842773,"_step":3557,"_timestamp":1.7473729009748125e+09,"train_loss":1.6298207611684346,"eval/dev_samples_per_second":13.09,"_wandb":{"runtime":35917},"train/loss":1.1786,"substitutions":4.88,"eval/test_wer":0.08848048503220916,"word_accuracy":93.02,"train_steps_per_second":0.977,"eval/steps_per_second":0.139,"eval/samples_per_second":12.997,"model_speed2size2_table":{"_latest_artifact_path":"wandb-client-artifact://li63ohwsnagsgukuzxxoedjq23zmywgayy9s567hfvm8iuj19u9laeb1yfyt7s5gx88u27qlfs25bna6xx34hq9yv5rrhz6pa1s7mu8ilv3xwtp7gu06fdhb5anlvjg9:latest/model_speed2size2_table.table.json","path":"media/table/model_speed2size2_table_3556_ffc3f22eaf8a279337f3.table.json","ncols":2,"nrows":1,"_type":"table-file","sha256":"ffc3f22eaf8a279337f31351ecc40b7b10ad3fc2530ff63bb94dfd99cf1707b4","size":95,"artifact_path":"wandb-client-artifact://zs3kneyjnk8zwtj296er3yaztikcv5ro8nl4jnqbua1dzk6zoaan1kxcdfuhwv9us50mlrguj0mgrvntft5p6iuju1qayquwa3b3l93g58r7o5ifatkwzdkuj0bq8y16/model_speed2size2_table.table.json"},"total_flos":0,"eval/test_loss":1.0758554935455322,"test_sample_index":111,"word_errors":8.84,"insertions":1.86,"eval/test_steps_per_second":0.136}
wandb/run-20250515_192303-7xkscxrj/logs/debug-core.log ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"time":"2025-05-15T19:23:02.990833788Z","level":"INFO","msg":"main: starting server","port-filename":"/tmp/tmpi1e880he/port-4153506.txt","pid":4153506,"log-level":0,"disable-analytics":false,"shutdown-on-parent-exit":false}
2
+ {"time":"2025-05-15T19:23:02.992445129Z","level":"INFO","msg":"Will exit if parent process dies.","ppid":4153506}
3
+ {"time":"2025-05-15T19:23:02.992359253Z","level":"INFO","msg":"server is running","addr":{"IP":"127.0.0.1","Port":42841,"Zone":""}}
4
+ {"time":"2025-05-15T19:23:03.146123389Z","level":"INFO","msg":"connection: ManageConnectionData: new connection created","id":"127.0.0.1:33852"}
5
+ {"time":"2025-05-15T19:23:03.494056011Z","level":"INFO","msg":"handleInformInit: received","streamId":"7xkscxrj","id":"127.0.0.1:33852"}
6
+ {"time":"2025-05-15T19:23:03.597616412Z","level":"INFO","msg":"handleInformInit: stream started","streamId":"7xkscxrj","id":"127.0.0.1:33852"}
7
+ {"time":"2025-05-16T05:21:43.113058108Z","level":"INFO","msg":"handleInformFinish: finish message received","streamId":"7xkscxrj","id":"127.0.0.1:33852"}
8
+ {"time":"2025-05-16T05:21:43.113219442Z","level":"INFO","msg":"handleInformFinish: stream closed","streamId":"7xkscxrj","id":"127.0.0.1:33852"}
9
+ {"time":"2025-05-16T05:21:43.15553858Z","level":"INFO","msg":"handleInformTeardown: server teardown initiated","id":"127.0.0.1:33852"}
10
+ {"time":"2025-05-16T05:21:43.155584566Z","level":"INFO","msg":"handleInformTeardown: server shutdown complete","id":"127.0.0.1:33852"}
11
+ {"time":"2025-05-16T05:21:43.155594696Z","level":"INFO","msg":"server is shutting down"}
12
+ {"time":"2025-05-16T05:21:43.155632972Z","level":"INFO","msg":"connection: closing","id":"127.0.0.1:33852"}
13
+ {"time":"2025-05-16T05:21:43.155728782Z","level":"INFO","msg":"connection: closed successfully","id":"127.0.0.1:33852"}
14
+ {"time":"2025-05-16T05:21:43.155739402Z","level":"INFO","msg":"connection: ManageConnectionData: connection closed","id":"127.0.0.1:33852"}
15
+ {"time":"2025-05-16T05:21:43.155754471Z","level":"INFO","msg":"server is closed"}
wandb/run-20250515_192303-7xkscxrj/logs/debug-internal.log ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"time":"2025-05-15T19:23:03.494323592Z","level":"INFO","msg":"stream: starting","core version":"0.19.7","symlink path":"/media/disk/mh_dp/wandb/run-20250515_192303-7xkscxrj/logs/debug-core.log"}
2
+ {"time":"2025-05-15T19:23:03.597586687Z","level":"INFO","msg":"created new stream","id":"7xkscxrj"}
3
+ {"time":"2025-05-15T19:23:03.597610471Z","level":"INFO","msg":"stream: started","id":"7xkscxrj"}
4
+ {"time":"2025-05-15T19:23:03.597663725Z","level":"INFO","msg":"writer: Do: started","stream_id":"7xkscxrj"}
5
+ {"time":"2025-05-15T19:23:03.597740644Z","level":"INFO","msg":"sender: started","stream_id":"7xkscxrj"}
6
+ {"time":"2025-05-15T19:23:03.597802771Z","level":"INFO","msg":"handler: started","stream_id":"7xkscxrj"}
7
+ {"time":"2025-05-15T19:23:03.929230983Z","level":"INFO","msg":"Starting system monitor"}
8
+ {"time":"2025-05-15T21:30:19.314744757Z","level":"INFO","msg":"api: retrying HTTP error","status":502,"url":"https://api.wandb.ai/files/xhorni20-fitvut/seq2seq_encoder-decoder_vox/7xkscxrj/file_stream","body":"\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>502 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"}
9
+ {"time":"2025-05-16T05:21:40.975811644Z","level":"INFO","msg":"Stopping system monitor"}
10
+ {"time":"2025-05-16T05:21:40.97633554Z","level":"INFO","msg":"Stopped system monitor"}
11
+ {"time":"2025-05-16T05:21:41.958538892Z","level":"INFO","msg":"fileTransfer: Close: file transfer manager closed"}
12
+ {"time":"2025-05-16T05:21:41.977317704Z","level":"INFO","msg":"handler: operation stats","stats":{"operations":[{"desc":"uploading history steps 3555-3557, summary, console lines 6010-6016","runtime_seconds":0.018694493}],"total_operations":1}}
13
+ {"time":"2025-05-16T05:21:43.113094437Z","level":"INFO","msg":"stream: closing","id":"7xkscxrj"}
14
+ {"time":"2025-05-16T05:21:43.113123361Z","level":"INFO","msg":"handler: closed","stream_id":"7xkscxrj"}
15
+ {"time":"2025-05-16T05:21:43.113134703Z","level":"INFO","msg":"writer: Close: closed","stream_id":"7xkscxrj"}
16
+ {"time":"2025-05-16T05:21:43.113199044Z","level":"INFO","msg":"sender: closed","stream_id":"7xkscxrj"}
17
+ {"time":"2025-05-16T05:21:43.113213511Z","level":"INFO","msg":"stream: closed","id":"7xkscxrj"}
wandb/run-20250515_192303-7xkscxrj/logs/debug.log ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 2025-05-15 19:23:03,488 INFO MainThread:4153506 [wandb_setup.py:_flush():67] Current SDK version is 0.19.7
2
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_setup.py:_flush():67] Configure stats pid to 4153506
3
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_setup.py:_flush():67] Loading settings from /home/azureuser/.config/wandb/settings
4
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_setup.py:_flush():67] Loading settings from /media/disk/mh_dp/wandb/settings
5
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_setup.py:_flush():67] Loading settings from environment variables
6
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:setup_run_log_directory():647] Logging user logs to /media/disk/mh_dp/wandb/run-20250515_192303-7xkscxrj/logs/debug.log
7
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:setup_run_log_directory():648] Logging internal logs to /media/disk/mh_dp/wandb/run-20250515_192303-7xkscxrj/logs/debug-internal.log
8
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:init():761] calling init triggers
9
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:init():766] wandb.init called with sweep_config: {}
10
+ config: {'_wandb': {}}
11
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:init():784] starting backend
12
+ 2025-05-15 19:23:03,489 INFO MainThread:4153506 [wandb_init.py:init():788] sending inform_init request
13
+ 2025-05-15 19:23:03,492 INFO MainThread:4153506 [backend.py:_multiprocessing_setup():97] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
14
+ 2025-05-15 19:23:03,492 INFO MainThread:4153506 [wandb_init.py:init():803] backend started and connected
15
+ 2025-05-15 19:23:03,493 INFO MainThread:4153506 [wandb_init.py:init():896] updated telemetry
16
+ 2025-05-15 19:23:03,498 INFO MainThread:4153506 [wandb_init.py:init():920] communicating run to backend with 90.0 second timeout
17
+ 2025-05-15 19:23:03,927 INFO MainThread:4153506 [wandb_init.py:init():995] starting run threads in backend
18
+ 2025-05-15 19:23:04,024 INFO MainThread:4153506 [wandb_run.py:_console_start():2377] atexit reg
19
+ 2025-05-15 19:23:04,024 INFO MainThread:4153506 [wandb_run.py:_redirect():2227] redirect: wrap_raw
20
+ 2025-05-15 19:23:04,024 INFO MainThread:4153506 [wandb_run.py:_redirect():2292] Wrapping output streams.
21
+ 2025-05-15 19:23:04,024 INFO MainThread:4153506 [wandb_run.py:_redirect():2317] Redirects installed.
22
+ 2025-05-15 19:23:04,026 INFO MainThread:4153506 [wandb_init.py:init():1037] run started, returning control to user process
23
+ 2025-05-15 19:23:07,838 INFO MainThread:4153506 [wandb_run.py:_config_callback():1261] config_cb None None {'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float32', 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': False, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': True, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': None, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['SpeechEncoderDecoderModel'], 'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': None, 'pad_token_id': 1, 'eos_token_id': 2, 'sep_token_id': None, 'decoder_start_token_id': 0, 'task_specific_params': None, 'problem_type': None, '_name_or_path': './seq2seq_wav2vec2_bart-base_24k-en-voxpopuli', '_attn_implementation_autoset': True, 'transformers_version': '4.46.3', 'decoder': {'vocab_size': 50265, 'max_position_embeddings': 1024, 'd_model': 768, 'encoder_ffn_dim': 3072, 'encoder_layers': 6, 'encoder_attention_heads': 12, 'decoder_ffn_dim': 3072, 'decoder_layers': 6, 'decoder_attention_heads': 12, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.1, 'activation_function': 'gelu', 'init_std': 0.02, 'encoder_layerdrop': 0.0, 'decoder_layerdrop': 0.0, 'classifier_dropout': 0.0, 'use_cache': True, 'num_hidden_layers': 6, 'scale_embedding': False, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float32', 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': True, 'cross_attention_hidden_size': None, 'add_cross_attention': True, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': True, 'num_beams': 4, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 3, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['BartModel'], 'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1', 2: 'LABEL_2'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1, 'LABEL_2': 2}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 0, 'pad_token_id': 1, 'eos_token_id': 2, 'sep_token_id': None, 'decoder_start_token_id': 2, 'task_specific_params': {'summarization': {'length_penalty': 1.0, 'max_length': 128, 'min_length': 12, 'num_beams': 4}, 'summarization_cnn': {'length_penalty': 2.0, 'max_length': 142, 'min_length': 56, 'num_beams': 4}, 'summarization_xsum': {'length_penalty': 1.0, 'max_length': 62, 'min_length': 11, 'num_beams': 6}}, 'problem_type': None, '_name_or_path': 'facebook/bart-base', '_attn_implementation_autoset': True, 'add_bias_logits': False, 'add_final_layer_norm': False, 'classif_dropout': 0.1, 'gradient_checkpointing': False, 'normalize_before': False, 'normalize_embedding': True, 'model_type': 'bart'}, 'encoder': {'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': 'float32', 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['Wav2Vec2ForPreTraining'], 'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 1, 'pad_token_id': 0, 'eos_token_id': 2, 'sep_token_id': None, 'decoder_start_token_id': None, 'task_specific_params': None, 'problem_type': None, '_name_or_path': 'facebook/wav2vec2-base-en-voxpopuli-v2', '_attn_implementation_autoset': True, 'freeze_feat_extract_train': True, 'mask_channel_length': 10, 'mask_channel_min_space': 1, 'mask_channel_other': 0.0, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_time_min_space': 1, 'mask_time_other': 0.0, 'mask_time_selection': 'static', 'no_mask_channel_overlap': False, 'no_mask_time_overlap': False, 'num_feat_extract_layers': 7, 'hidden_size': 768, 'feat_extract_norm': 'group', 'feat_extract_activation': 'gelu', 'conv_dim': [512, 512, 512, 512, 512, 512, 512], 'conv_stride': [5, 2, 2, 2, 2, 2, 2], 'conv_kernel': [10, 3, 3, 3, 3, 2, 2], 'conv_bias': False, 'num_conv_pos_embeddings': 128, 'num_conv_pos_embedding_groups': 16, 'num_hidden_layers': 12, 'intermediate_size': 3072, 'hidden_act': 'gelu', 'num_attention_heads': 12, 'hidden_dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'feat_proj_dropout': 0.0, 'final_dropout': 0.0, 'layerdrop': 0.0, 'layer_norm_eps': 1e-05, 'initializer_range': 0.02, 'vocab_size': 32, 'do_stable_layer_norm': False, 'use_weighted_layer_sum': False, 'apply_spec_augment': True, 'mask_time_prob': 0.25, 'mask_time_length': 30, 'mask_time_min_masks': 2, 'mask_feature_prob': 0.3, 'mask_feature_length': 30, 'mask_feature_min_masks': 1, 'num_codevectors_per_group': 320, 'num_codevector_groups': 2, 'contrastive_logits_temperature': 0.1, 'feat_quantizer_dropout': 0.0, 'num_negatives': 100, 'codevector_dim': 256, 'proj_codevector_dim': 256, 'diversity_loss_weight': 0.1, 'ctc_loss_reduction': 'sum', 'ctc_zero_infinity': False, 'add_adapter': True, 'adapter_kernel_size': 3, 'adapter_stride': 2, 'num_adapter_layers': 3, 'output_hidden_size': 768, 'adapter_attn_dim': None, 'classifier_proj_size': 256, 'tdnn_dim': [512, 512, 512, 512, 1500], 'tdnn_kernel': [5, 3, 3, 1, 1], 'tdnn_dilation': [1, 2, 3, 1, 1], 'xvector_output_dim': 512, 'model_type': 'wav2vec2'}, 'model_type': 'speech-encoder-decoder', 'processor_class': 'Wav2Vec2Processor', 'use_cache': False, 'forced_decoder_ids': None, 'output_dir': './seq2seq_wav2vec2_bart-base_24k-en-voxpopuli/t1_new1_spec', 'overwrite_output_dir': True, 'do_train': True, 'do_eval': True, 'do_predict': True, 'eval_strategy': 'steps', 'prediction_loss_only': False, 'per_device_train_batch_size': 96, 'per_device_eval_batch_size': 96, 'per_gpu_train_batch_size': None, 'per_gpu_eval_batch_size': None, 'gradient_accumulation_steps': 1, 'eval_accumulation_steps': None, 'eval_delay': 0, 'torch_empty_cache_steps': None, 'learning_rate': 0.0001, 'weight_decay': 0.01, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-08, 'max_grad_norm': 1.0, 'num_train_epochs': 20.0, 'max_steps': -1, 'lr_scheduler_type': 'cosine_with_min_lr', 'lr_scheduler_kwargs': {'min_lr': 5e-09}, 'warmup_ratio': 0.0, 'warmup_steps': 2000, 'log_level': 'passive', 'log_level_replica': 'warning', 'log_on_each_node': True, 'logging_dir': './seq2seq_wav2vec2_bart-base_24k-en-voxpopuli/t1_new1_spec/runs/May15_19-23-02_achjo', 'logging_strategy': 'steps', 'logging_first_step': False, 'logging_steps': 10, 'logging_nan_inf_filter': True, 'save_strategy': 'steps', 'save_steps': 1000, 'save_total_limit': 1, 'save_safetensors': True, 'save_on_each_node': False, 'save_only_model': False, 'restore_callback_states_from_checkpoint': False, 'no_cuda': False, 'use_cpu': False, 'use_mps_device': False, 'seed': 42, 'data_seed': None, 'jit_mode_eval': False, 'use_ipex': False, 'bf16': True, 'fp16': False, 'fp16_opt_level': 'O1', 'half_precision_backend': 'auto', 'bf16_full_eval': False, 'fp16_full_eval': False, 'tf32': None, 'local_rank': 0, 'ddp_backend': None, 'tpu_num_cores': None, 'tpu_metrics_debug': False, 'debug': [], 'dataloader_drop_last': False, 'eval_steps': 1000, 'dataloader_num_workers': 16, 'dataloader_prefetch_factor': 2, 'past_index': -1, 'run_name': 'facebook/voxpopuli_en_split-train_wav2vec2-bart_bs96_lr0.0001_ep20.0', 'disable_tqdm': False, 'remove_unused_columns': True, 'label_names': None, 'load_best_model_at_end': True, 'metric_for_best_model': 'wer', 'greater_is_better': False, 'ignore_data_skip': False, 'fsdp': [], 'fsdp_min_num_params': 0, 'fsdp_config': {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, 'fsdp_transformer_layer_cls_to_wrap': None, 'accelerator_config': {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}, 'deepspeed': None, 'label_smoothing_factor': 0.05, 'optim': 'adamw_torch', 'optim_args': None, 'adafactor': False, 'group_by_length': False, 'length_column_name': 'input_length', 'report_to': ['wandb'], 'ddp_find_unused_parameters': None, 'ddp_bucket_cap_mb': None, 'ddp_broadcast_buffers': None, 'dataloader_pin_memory': True, 'dataloader_persistent_workers': False, 'skip_memory_metrics': True, 'use_legacy_prediction_loop': False, 'push_to_hub': False, 'resume_from_checkpoint': None, 'hub_model_id': None, 'hub_strategy': 'every_save', 'hub_token': '<HUB_TOKEN>', 'hub_private_repo': False, 'hub_always_push': False, 'gradient_checkpointing': False, 'gradient_checkpointing_kwargs': None, 'include_inputs_for_metrics': False, 'include_for_metrics': [], 'eval_do_concat_batches': True, 'fp16_backend': 'auto', 'evaluation_strategy': None, 'push_to_hub_model_id': None, 'push_to_hub_organization': None, 'push_to_hub_token': '<PUSH_TO_HUB_TOKEN>', 'mp_parameters': '', 'auto_find_batch_size': False, 'full_determinism': False, 'torchdynamo': None, 'ray_scope': 'last', 'ddp_timeout': 1800, 'torch_compile': False, 'torch_compile_backend': None, 'torch_compile_mode': None, 'dispatch_batches': None, 'split_batches': None, 'include_tokens_per_second': False, 'include_num_input_tokens_seen': False, 'neftune_noise_alpha': None, 'optim_target_modules': None, 'batch_eval_metrics': False, 'eval_on_start': False, 'use_liger_kernel': False, 'eval_use_gather_object': False, 'average_tokens_across_devices': False, 'sortish_sampler': False, 'predict_with_generate': True, 'generation_max_length': None, 'generation_num_beams': None, 'generation_config': None}
24
+ 2025-05-15 19:23:07,840 INFO MainThread:4153506 [wandb_config.py:__setitem__():154] config set model/num_parameters = 201096832 - <bound method Run._config_callback of <wandb.sdk.wandb_run.Run object at 0x7fdd43c46990>>
25
+ 2025-05-15 19:23:07,840 INFO MainThread:4153506 [wandb_run.py:_config_callback():1261] config_cb model/num_parameters 201096832 None
26
+ 2025-05-16 05:21:40,579 INFO MainThread:4153506 [wandb_run.py:_config_callback():1261] config_cb ('_wandb', 'visualize', 'model_speed2size1') {'panel_type': 'Vega2', 'panel_config': {'panelDefId': 'wandb/scatter/v0', 'fieldSettings': {'x': 'Time per step', 'y': 'Trainable parameters'}, 'stringSettings': {'title': ''}, 'transform': {'name': 'tableWithLeafColNames'}, 'userQuery': {'queryFields': [{'name': 'runSets', 'args': [{'name': 'runSets', 'value': '${runSets}'}], 'fields': [{'name': 'id', 'fields': []}, {'name': 'name', 'fields': []}, {'name': '_defaultColorIndex', 'fields': []}, {'name': 'summaryTable', 'args': [{'name': 'tableKey', 'value': 'model_speed2size1_table'}], 'fields': []}]}]}}} None
27
+ 2025-05-16 05:21:40,886 INFO MainThread:4153506 [wandb_run.py:_config_callback():1261] config_cb ('_wandb', 'visualize', 'model_speed2size2') {'panel_type': 'Vega2', 'panel_config': {'panelDefId': 'wandb/scatter/v0', 'fieldSettings': {'x': 'Time per step', 'y': 'Total parameters'}, 'stringSettings': {'title': ''}, 'transform': {'name': 'tableWithLeafColNames'}, 'userQuery': {'queryFields': [{'name': 'runSets', 'args': [{'name': 'runSets', 'value': '${runSets}'}], 'fields': [{'name': 'id', 'fields': []}, {'name': 'name', 'fields': []}, {'name': '_defaultColorIndex', 'fields': []}, {'name': 'summaryTable', 'args': [{'name': 'tableKey', 'value': 'model_speed2size2_table'}], 'fields': []}]}]}}} None
28
+ 2025-05-16 05:21:40,974 INFO MainThread:4153506 [wandb_run.py:_finish():2112] finishing run xhorni20-fitvut/seq2seq_encoder-decoder_vox/7xkscxrj
29
+ 2025-05-16 05:21:40,975 INFO MainThread:4153506 [wandb_run.py:_atexit_cleanup():2342] got exitcode: 0
30
+ 2025-05-16 05:21:40,975 INFO MainThread:4153506 [wandb_run.py:_restore():2324] restore
31
+ 2025-05-16 05:21:40,975 INFO MainThread:4153506 [wandb_run.py:_restore():2330] restore done
32
+ 2025-05-16 05:21:43,110 INFO MsgRouterThr:4153506 [mailbox.py:close():115] Closing mailbox, abandoning 1 handles.
33
+ 2025-05-16 05:21:43,111 INFO MainThread:4153506 [wandb_run.py:_footer_history_summary_info():3958] rendering history
34
+ 2025-05-16 05:21:43,112 INFO MainThread:4153506 [wandb_run.py:_footer_history_summary_info():3990] rendering summary
35
+ 2025-05-16 05:21:43,112 INFO MainThread:4153506 [wandb_run.py:_footer_sync_info():3919] logging synced files
wandb/run-20250515_192303-7xkscxrj/run-7xkscxrj.wandb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5c576cb9e4a59adc74876e8b216885ecc868eb504703fd30c956b98a33e5571e
3
+ size 19765491