Update README.md
Browse files
README.md
CHANGED
@@ -77,7 +77,9 @@ Mostly, this is a test of (significant) pruning & healing an instruct-tuned mode
|
|
77 |
|
78 |
## Healing / Finetune
|
79 |
I healed the model by doing a full weight DPO finetune for 139k samples (3.15 epochs), and then a LoRA with r=128 a=256 for 73k samples (1.67 epochs).
|
|
|
80 |
Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
|
|
|
81 |
The results are pretty good! The model has issues, but could have legitimate uses. It carry on a conversation. It's certainly usable, if not useful.
|
82 |
|
83 |
Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
|
@@ -87,6 +89,7 @@ This model has 67% the parameters of the original, and has:
|
|
87 |
- ~65% the Hellaswag score
|
88 |
- ~85% the Winogrande score
|
89 |
- ~45% the the MMLU score
|
|
|
90 |
An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I had issues running the GSM8K and BBH benchmarks.)
|
91 |
I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
|
92 |
|
@@ -99,6 +102,7 @@ This size should allow for:
|
|
99 |
- Q8 or Q6 inference on 6GB VRAM
|
100 |
- Q5 inference on 4GB VRAM
|
101 |
- Fine-tuning on ... well, with less VRAM than an 8B model
|
|
|
102 |
And of course, as stated, it was a test of significant pruning, and of pruning&healing an instruct-tuned model. As a test, I think it's definitely successful.
|
103 |
|
104 |
## Mergekit Details
|
|
|
77 |
|
78 |
## Healing / Finetune
|
79 |
I healed the model by doing a full weight DPO finetune for 139k samples (3.15 epochs), and then a LoRA with r=128 a=256 for 73k samples (1.67 epochs).
|
80 |
+
|
81 |
Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
|
82 |
+
|
83 |
The results are pretty good! The model has issues, but could have legitimate uses. It carry on a conversation. It's certainly usable, if not useful.
|
84 |
|
85 |
Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
|
|
|
89 |
- ~65% the Hellaswag score
|
90 |
- ~85% the Winogrande score
|
91 |
- ~45% the the MMLU score
|
92 |
+
|
93 |
An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I had issues running the GSM8K and BBH benchmarks.)
|
94 |
I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
|
95 |
|
|
|
102 |
- Q8 or Q6 inference on 6GB VRAM
|
103 |
- Q5 inference on 4GB VRAM
|
104 |
- Fine-tuning on ... well, with less VRAM than an 8B model
|
105 |
+
-
|
106 |
And of course, as stated, it was a test of significant pruning, and of pruning&healing an instruct-tuned model. As a test, I think it's definitely successful.
|
107 |
|
108 |
## Mergekit Details
|