HaileyStorm commited on
Commit
55ca955
·
verified ·
1 Parent(s): 9a04195

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -0
README.md CHANGED
@@ -77,7 +77,9 @@ Mostly, this is a test of (significant) pruning & healing an instruct-tuned mode
77
 
78
  ## Healing / Finetune
79
  I healed the model by doing a full weight DPO finetune for 139k samples (3.15 epochs), and then a LoRA with r=128 a=256 for 73k samples (1.67 epochs).
 
80
  Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
 
81
  The results are pretty good! The model has issues, but could have legitimate uses. It carry on a conversation. It's certainly usable, if not useful.
82
 
83
  Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
@@ -87,6 +89,7 @@ This model has 67% the parameters of the original, and has:
87
  - ~65% the Hellaswag score
88
  - ~85% the Winogrande score
89
  - ~45% the the MMLU score
 
90
  An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I had issues running the GSM8K and BBH benchmarks.)
91
  I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
92
 
@@ -99,6 +102,7 @@ This size should allow for:
99
  - Q8 or Q6 inference on 6GB VRAM
100
  - Q5 inference on 4GB VRAM
101
  - Fine-tuning on ... well, with less VRAM than an 8B model
 
102
  And of course, as stated, it was a test of significant pruning, and of pruning&healing an instruct-tuned model. As a test, I think it's definitely successful.
103
 
104
  ## Mergekit Details
 
77
 
78
  ## Healing / Finetune
79
  I healed the model by doing a full weight DPO finetune for 139k samples (3.15 epochs), and then a LoRA with r=128 a=256 for 73k samples (1.67 epochs).
80
+
81
  Prior to healing, the model returned absolute gibberish to any prompt, rarely two real words together. For example, give "2+2=" it might return "Mahmisan Pannpyout Na RMITa CMI TTi GP BP GP RSi TBi DD PS..."
82
+
83
  The results are pretty good! The model has issues, but could have legitimate uses. It carry on a conversation. It's certainly usable, if not useful.
84
 
85
  Truthfulness and commonsense reasoning suffered the least from the prune / were healed the best. Knowledge and complex reasoning suffered the most.
 
89
  - ~65% the Hellaswag score
90
  - ~85% the Winogrande score
91
  - ~45% the the MMLU score
92
+
93
  An average of 69% the benchmark scores for 67% the parameters, not bad! (Note, I had issues running the GSM8K and BBH benchmarks.)
94
  I do believe it could be much better, by doing the pruning in stages (say, 4 layers at a time) with some healing in between, and longer healing at the end with a more diverse dataset.
95
 
 
102
  - Q8 or Q6 inference on 6GB VRAM
103
  - Q5 inference on 4GB VRAM
104
  - Fine-tuning on ... well, with less VRAM than an 8B model
105
+ -
106
  And of course, as stated, it was a test of significant pruning, and of pruning&healing an instruct-tuned model. As a test, I think it's definitely successful.
107
 
108
  ## Mergekit Details