Text Generation
Transformers
Safetensors
English
llama
alignment-handbook
axolotl
trl
dpo
sft
Generated from Trainer
conversational
text-generation-inference
Zhangchen Xu commited on
Commit
1b8b63a
·
verified ·
1 Parent(s): 17b4f72

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +301 -28
README.md CHANGED
@@ -1,53 +1,213 @@
1
  ---
2
  license: llama3
3
- base_model: Magpie-Align/Llama-3-8B-Magpie-Mix-300KMT-150KR
4
  tags:
5
  - alignment-handbook
 
6
  - trl
7
  - dpo
8
- - generated_from_trainer
9
- - trl
10
- - dpo
11
- - alignment-handbook
12
  - generated_from_trainer
13
  datasets:
14
  - princeton-nlp/llama3-ultrafeedback-armorm
 
 
15
  model-index:
16
- - name: Llama-3-8B-Magpi-Pro-MTR-UltraDPO-08
17
  results: []
 
 
18
  ---
19
 
20
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
21
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="200" height="32"/>](https://wandb.ai/uw-nsl/huggingface/runs/22x1s5jw)
24
- # Llama-3-8B-Magpi-Pro-MTR-UltraDPO-08
 
 
 
 
25
 
26
- This model is a fine-tuned version of [Magpie-Align/Llama-3-8B-Magpie-Mix-300KMT-150KR](https://huggingface.co/Magpie-Align/Llama-3-8B-Magpie-Mix-300KMT-150KR) on the princeton-nlp/llama3-ultrafeedback-armorm dataset.
27
- It achieves the following results on the evaluation set:
28
- - Loss: 0.3821
29
- - Rewards/chosen: -4.4702
30
- - Rewards/rejected: -6.1325
31
- - Rewards/accuracies: 0.8669
32
- - Rewards/margins: 1.6623
33
- - Logps/rejected: -886.8052
34
- - Logps/chosen: -726.5228
35
- - Logits/rejected: -1.3434
36
- - Logits/chosen: -1.3243
37
 
38
- ## Model description
 
 
 
 
39
 
40
- More information needed
 
 
 
 
 
 
 
41
 
42
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
43
 
44
- More information needed
45
 
46
- ## Training and evaluation data
47
 
48
- More information needed
49
 
50
- ## Training procedure
51
 
52
  ### Training hyperparameters
53
 
@@ -82,3 +242,116 @@ The following hyperparameters were used during training:
82
  - Pytorch 2.3.1+cu121
83
  - Datasets 2.20.0
84
  - Tokenizers 0.19.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama3
3
+ base_model: meta-llama/Meta-Llama-3-8B
4
  tags:
5
  - alignment-handbook
6
+ - axolotl
7
  - trl
8
  - dpo
9
+ - sft
 
 
 
10
  - generated_from_trainer
11
  datasets:
12
  - princeton-nlp/llama3-ultrafeedback-armorm
13
+ - Magpie-Align/Magpie-Pro-MT-300K-v0.1
14
+ - Magpie-Align/Magpie-Reasoning-150K
15
  model-index:
16
+ - name: Magpie-Align/Llama-3-8B-Magpie-Align-v0.2
17
  results: []
18
+ language:
19
+ - en
20
  ---
21
 
22
+ ![Magpie](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/FWWILXrAGNwWr52aghV0S.png)
23
+ ## 🔥 Chat with Magpie [Here](https://huggingface.co/spaces/flydust/Chat-with-Magpie)!
24
+
25
+ # 🐦 Llama-3-8B-Magpie-Align-v0.2
26
+
27
+ Project Web: [https://magpie-align.github.io/](https://magpie-align.github.io/)
28
+
29
+ Online Model Demo: [https://huggingface.co/spaces/flydust/Chat-with-Magpie](https://huggingface.co/spaces/flydust/Chat-with-Magpie)
30
+
31
+ Arxiv Technical Report: [https://arxiv.org/abs/2406.08464](https://arxiv.org/abs/2406.08464)
32
+
33
+ Codes: [https://github.com/magpie-align/magpie](https://github.com/magpie-align/magpie)
34
+
35
+ ## 🧐 About This Model
36
+
37
+ This model is an aligned version of [meta-llama/Meta-Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B). We apply the following pipeline:
38
+ - We first use [Magpie-Align/Magpie-Pro-MT-300K-v0.1](https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-MT-300K-v0.1) and [Magpie-Align/Magpie-Reasoning-150K](https://huggingface.co/datasets/Magpie-Align/Magpie-Reasoning-150K) dataset and perform SFT -> [Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.2](https://huggingface.co/Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v0.2)
39
+ - We then perform DPO on the [princeton-nlp/llama3-ultrafeedback-armorm](https://huggingface.co/datasets/princeton-nlp/llama3-ultrafeedback-armorm) dataset.
40
+
41
+ The overall performance is even better than the official Llama-3-8B-Instruct Model!
42
+
43
+ - **Alpaca Eval 2 (vs GPT-4-Turbo-1106): 49.86 (LC), 51.98 (WR)**
44
+ - **Alpaca Eval 2 (vs Llama-3-8B-Instruct): 75.17 (LC), 78.20 (WR)**
45
+ - **Arena Hard: 37.5**
46
+ - **WildBench WB-Score: 42.7**
47
+ - **Zero-Eval MMLU: 46.70**
48
+
49
+ ## 🔥 Model Performance
50
+
51
+ We compare our Llama-3-8B-Magpie-Align with official and other **open-aligned LLMs** that have been fine-tuned from base models and have publicly released their training datasets. The results are as follows:
52
+
53
+ ```
54
+ +---------------------------------------------+--------------------+--------------------+-----------------------+------------+
55
+ | Aligned Model ID | MT-Bench | Alpaca Eval 2 | Alpaca Eval 2 | Arena Hard |
56
+ | | | (GPT-4-Turbo-1106) | (Llama-3-8B-Instruct) | |
57
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
58
+ | | R1 | R2 | AVG | LC WR | WR | LC WR | WR | Score |
59
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
60
+ | meta-llama/Meta-Llama-3-8B-Instruct | 8.31 | 7.65 | 7.98 | 22.92 | 22.57 | 50 | 50 | 20.6 |
61
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
62
+ | princeton-nlp/Llama-3-Base-8B-SFT-DPO | 8.12 | 7.23 | 7.67 | 17.71 | 15.34 | 43.73 | 38.80 | 14.8 |
63
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
64
+ | NousResearch/Hermes-2-Pro-Llama-3-8B | 8.05 | 7.35 | 7.70 | 15.60 | 12.86 | 36.37 | 30.52 | 11.5 |
65
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
66
+ | allenai/llama-3-tulu-2-dpo-8b | 7.71 | 7.15 | 7.43 | 14.89 | 14.8 | 35.43 | 35.42 | 11.7 |
67
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
68
+ | cognitivecomputations/dolphin-2.9-llama3-8b | 7.97 | 6.98 | 7.47 | 12.50 | 8.79 | 32.67 | 22.80 | 8.2 |
69
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
70
+ | openchat/openchat-3.6-8b-20240522 | 7.83 | 7.23 | 7.53 | 17.70 | 12.53 | 41.30 | 30.79 | 6.7 |
71
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
72
+ | Magpie-Align/Llama-3-8B-Magpie-Align-v0.1 | 8.01 | 7.63 | 7.82 | 38.52 | 38.47 | 69.37 | 70.05 | 32.4 |
73
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
74
+ | Magpie-Align/Llama-3-8B-Magpie-Align-v0.2 | 7.81 | 7.64 | 7.73 | 49.86 | 51.98 | 75.17 | 78.20 | 37.5 |
75
+ +---------------------------------------------+------+------+------+----------+---------+-----------+-----------+------------+
76
+ ```
77
+
78
+ ## 👀 Other Information
79
+
80
+ **License**: Please follow [Meta Llama 3 Community License](https://llama.meta.com/llama3/license).
81
+
82
+ **Conversation Template**: Please use Llama 3 **official chat template** for the best performance.
83
+
84
+ **How to use it?** Please check the official [Llama 3 repository](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct#how-to-use) for detailed instructions. Simply replace the original `model_id` with `Magpie-Align/Llama-3-8B-Magpie-Align-SFT-v1.0`.
85
+
86
+
87
+ The detailed training pipeline is as follows.
88
+
89
+ ## Stage 1: Supervised Fine-tuning
90
+
91
+ We use [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) for SFT.
92
+
93
+ ### Training hyperparameters
94
+
95
+ The following hyperparameters were used during training:
96
+ - learning_rate: 2e-05
97
+ - train_batch_size: 1
98
+ - eval_batch_size: 1
99
+ - seed: 42
100
+ - distributed_type: multi-GPU
101
+ - num_devices: 4
102
+ - gradient_accumulation_steps: 32
103
+ - total_train_batch_size: 128
104
+ - total_eval_batch_size: 4
105
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
106
+ - lr_scheduler_type: cosine
107
+ - lr_scheduler_warmup_steps: 79
108
+ - num_epochs: 2
109
+
110
+ ### Training results
111
+
112
+ | Training Loss | Epoch | Step | Validation Loss |
113
+ |:-------------:|:------:|:----:|:---------------:|
114
+ | 0.8241 | 0.0024 | 1 | 0.8068 |
115
+ | 0.5623 | 0.2007 | 85 | 0.5087 |
116
+ | 0.4704 | 0.4014 | 170 | 0.4326 |
117
+ | 0.4478 | 0.6020 | 255 | 0.4079 |
118
+ | 0.4256 | 0.8027 | 340 | 0.3948 |
119
+ | 0.4261 | 1.0034 | 425 | 0.3867 |
120
+ | 0.3662 | 1.1844 | 510 | 0.3850 |
121
+ | 0.363 | 1.3851 | 595 | 0.3823 |
122
+ | 0.357 | 1.5858 | 680 | 0.3813 |
123
+ | 0.3677 | 1.7865 | 765 | 0.3813 |
124
+
125
+ ### Framework versions
126
+
127
+ - Transformers 4.42.3
128
+ - Pytorch 2.3.1+cu121
129
+ - Datasets 2.19.1
130
+ - Tokenizers 0.19.1
131
+
132
+ *Internal name for identification: Llama-3-8B-Magpie-Mix-300KMT-150KR*. Please change the model name in the below Axolotl config.
133
+
134
+ [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
135
+ <details><summary>See axolotl config</summary>
136
+
137
+ axolotl version: `0.4.1`
138
+ ```yaml
139
+ base_model: meta-llama/Meta-Llama-3-8B
140
+ model_type: LlamaForCausalLM
141
+ tokenizer_type: AutoTokenizer
142
+
143
+ load_in_8bit: false
144
+ load_in_4bit: false
145
+ strict: false
146
+
147
+ datasets:
148
+ - path: Magpie-Align/Magpie-Reasoning-150K
149
+ type: sharegpt
150
+ conversation: llama3
151
+ - path: Magpie-Align/Magpie-Pro-MT-300K-v0.1
152
+ type: sharegpt
153
+ conversation: llama3
154
+ dataset_prepared_path: last_run_prepared
155
+ val_set_size: 0.001
156
+ output_dir: axolotl_out/Llama-3-8B-Magpie-Mix-300KMT-150KR
157
+
158
+ sequence_len: 8192
159
+ sample_packing: true
160
+ eval_sample_packing: false
161
+ pad_to_sequence_len: true
162
 
163
+ wandb_project: SynDa
164
+ wandb_entity:
165
+ wandb_watch:
166
+ wandb_name: Llama-3-8B-Magpie-Mix-300KMT-150KR
167
+ wandb_log_model:
168
+ hub_model_id: Magpie-Align/Llama-3-8B-Magpie-Mix-300KMT-150KR
169
 
170
+ gradient_accumulation_steps: 32
171
+ micro_batch_size: 1
172
+ num_epochs: 2
173
+ optimizer: paged_adamw_8bit
174
+ lr_scheduler: cosine
175
+ learning_rate: 2e-5
 
 
 
 
 
176
 
177
+ train_on_inputs: false
178
+ group_by_length: false
179
+ bf16: auto
180
+ fp16:
181
+ tf32: false
182
 
183
+ gradient_checkpointing: true
184
+ gradient_checkpointing_kwargs:
185
+ use_reentrant: false
186
+ early_stopping_patience:
187
+ resume_from_checkpoint:
188
+ logging_steps: 1
189
+ xformers_attention:
190
+ flash_attention: true
191
 
192
+ warmup_ratio: 0.1
193
+ evals_per_epoch: 5
194
+ eval_table_size:
195
+ saves_per_epoch: 1
196
+ debug:
197
+ deepspeed:
198
+ weight_decay: 0.0
199
+ fsdp:
200
+ fsdp_config:
201
+ special_tokens:
202
+ pad_token: <|end_of_text|>
203
 
204
+ ```
205
 
206
+ </details><be>
207
 
208
+ ## Stage 2: Direct Preference Optimization
209
 
210
+ We use [alignment handbook](https://github.com/huggingface/alignment-handbook) for DPO.
211
 
212
  ### Training hyperparameters
213
 
 
242
  - Pytorch 2.3.1+cu121
243
  - Datasets 2.20.0
244
  - Tokenizers 0.19.1
245
+
246
+ <details><summary>See alignment handbook config</summary>
247
+
248
+ ```yaml
249
+ # Customized Configs
250
+ model_name_or_path: Magpie-Align/Llama-3-8B-Magpie-Mix-300KMT-150KR
251
+ hub_model_id: Magpie-Align/Llama-3-8B-Magpie-Align-v0.2-RC
252
+ output_dir: alignment_handbook_out/Llama-3-8B-Magpie-Align-v0.2-RC
253
+ run_name: Llama-3-8B-Magpie-Align-v0.2-RC
254
+
255
+ dataset_mixer:
256
+ princeton-nlp/llama3-ultrafeedback-armorm: 1.0
257
+ dataset_splits:
258
+ - train
259
+ - test
260
+ preprocessing_num_workers: 24
261
+
262
+ # DPOTrainer arguments
263
+ bf16: true
264
+ beta: 0.01
265
+ learning_rate: 0.8e-6
266
+ gradient_accumulation_steps: 8
267
+ per_device_train_batch_size: 2
268
+ per_device_eval_batch_size: 4
269
+ num_train_epochs: 1
270
+ max_length: 2048
271
+ max_prompt_length: 1800
272
+ warmup_ratio: 0.1
273
+ logging_steps: 1
274
+ lr_scheduler_type: cosine
275
+ optim: adamw_torch
276
+
277
+ torch_dtype: null
278
+ use_flash_attention_2: true
279
+ do_eval: true
280
+ evaluation_strategy: steps
281
+ eval_steps: 100
282
+ gradient_checkpointing: true
283
+ gradient_checkpointing_kwargs:
284
+ use_reentrant: False
285
+ log_level: info
286
+ push_to_hub: true
287
+ save_strategy: "steps"
288
+ save_steps: 100
289
+ save_total_limit: 1
290
+ seed: 42
291
+ report_to:
292
+ - wandb
293
+ ```
294
+ </details><be>
295
+
296
+ ## Downstream Performance
297
+ | Datasets | Llama-3-8B-Magpie-Align-v0.2 |
298
+ | :--- | :---: |
299
+ | MMLU (5) | 65.42 |
300
+ | ARC (25) | 63.91 |
301
+ | HellaSwag (25) | 81.66 |
302
+ | TruthfulQA (0) | 60.97 |
303
+ | Winogrande (5) | 73.40 |
304
+
305
+ ## Paper Abstract
306
+
307
+ <details><summary>Click Here</summary>
308
+ High-quality instruction data is critical for aligning large language models (LLMs). Although some models, such as Llama-3-Instruct, have open weights, their alignment data remain private, which hinders the democratization of AI. High human labor costs and a limited, predefined scope for prompting prevent existing open-source data creation methods from scaling effectively, potentially limiting the diversity and quality of public alignment datasets. Is it possible to synthesize high-quality instruction data at scale by extracting it directly from an aligned LLM? We present a self-synthesis method for generating large-scale alignment data named Magpie. Our key observation is that aligned LLMs like Llama-3-Instruct can generate a user query when we input only the left-side templates up to the position reserved for user messages, thanks to their auto-regressive nature. We use this method to prompt Llama-3-Instruct and generate 4 million instructions along with their corresponding responses. We perform a comprehensive analysis of the extracted data and select 300K high-quality instances. To compare Magpie data with other public instruction datasets, we fine-tune Llama-3-8B-Base with each dataset and evaluate the performance of the fine-tuned models. Our results indicate that in some tasks, models fine-tuned with Magpie perform comparably to the official Llama-3-8B-Instruct, despite the latter being enhanced with 10 million data points through supervised fine-tuning (SFT) and subsequent feedback learning. We also show that using Magpie solely for SFT can surpass the performance of previous public datasets utilized for both SFT and preference optimization, such as direct preference optimization with UltraFeedback. This advantage is evident on alignment benchmarks such as AlpacaEval, ArenaHard, and WildBench.
309
+ </details><be>
310
+
311
+ ## 📚 Citation
312
+
313
+ If you find the model, data, or code useful, please cite our paper:
314
+ ```
315
+ @article{xu2024magpie,
316
+ title={Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing},
317
+ author={Zhangchen Xu and Fengqing Jiang and Luyao Niu and Yuntian Deng and Radha Poovendran and Yejin Choi and Bill Yuchen Lin},
318
+ year={2024},
319
+ eprint={2406.08464},
320
+ archivePrefix={arXiv},
321
+ primaryClass={cs.CL}
322
+ }
323
+ ```
324
+
325
+ Please also cite the creators of preference datasets:
326
+
327
+ SimPO paper:
328
+ ```
329
+ @article{meng2024simpo,
330
+ title={{SimPO}: Simple preference optimization with a reference-free reward},
331
+ author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
332
+ journal={arXiv preprint arXiv:2405.14734},
333
+ year={2024}
334
+ }
335
+ ```
336
+
337
+ UltraFeedback paper:
338
+ ```
339
+ @article{cui2023ultrafeedback,
340
+ title={{UltraFeedback}: Boosting language models with high-quality feedback},
341
+ author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
342
+ journal={arXiv preprint arXiv:2310.01377},
343
+ year={2023}
344
+ }
345
+ ```
346
+
347
+ ArmoRM paper:
348
+ ```
349
+ @article{wang2024interpretable,
350
+ title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
351
+ author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
352
+ journal={arXiv preprint arXiv:2406.12845},
353
+ year={2024}
354
+ }
355
+ ```
356
+
357
+ **Questions?** Please contact [Zhangchen](https://zhangchenxu.com/) by email.