Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,151 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license:
|
3 |
+
- apache-2.0
|
4 |
+
- cc-by-sa-3.0
|
5 |
+
tags:
|
6 |
+
- generated_from_trainer
|
7 |
+
datasets:
|
8 |
+
- pszemraj/dolly_hhrlhf-text2text
|
9 |
+
widget:
|
10 |
+
- text: What is Deoxys in pokemon?
|
11 |
+
example_title: deoxys
|
12 |
+
- text: >-
|
13 |
+
combine the below summary excerpts into a single, cohesive short summary
|
14 |
+
without repetition: In this paper, we present a general approach to
|
15 |
+
extending pre-trained models to unlimited input lengths without adding
|
16 |
+
additional learning weights. We show that our approach works well on
|
17 |
+
datasets longer than the maximum input for these models. For example, a
|
18 |
+
dataset with a maximum input length of 16384 tokens can be extended to a
|
19 |
+
maximum length of 350K tokens. We also demonstrate that our method is able
|
20 |
+
to summarize even 350K token-long input sequences from BookSum.
|
21 |
+
|
22 |
+
In this paper, we describe the search step reformulation of attention. The
|
23 |
+
search step uses a single storage of hidden states for space efficiency. We
|
24 |
+
construct a total of two sets of datastores where L and H are the keys and
|
25 |
+
values stored in each set of stores. L is the amount of storage required to
|
26 |
+
retrieve the encoded tokens. H is the hidden states per head. This allows
|
27 |
+
retrieval augmentation at both time and space. Instead of using a single set
|
28 |
+
of decoder layers, we use a retrieval augmentation system that allows us to
|
29 |
+
simultaneously store multiple sets of tokens across two different sets of
|
30 |
+
storage. For example, we could store all tokens in one set of storage and
|
31 |
+
retrieve them all in the same set of tokens. This would be very similar to
|
32 |
+
the Memorization Transformers approach. However, instead of storing the
|
33 |
+
tokens in a single memory layer, we store them in a set of multiple storage
|
34 |
+
layers. This way, we don't have to store them all at once. This is why we
|
35 |
+
call this reformulation 'attention reformulation' rather than 'attention
|
36 |
+
formula.' We also call it 'retrieval augmentation' because it uses the same
|
37 |
+
number of storage layers as the original transformer attention formula. This
|
38 |
+
means that we can store the tokens across multiple storage systems without
|
39 |
+
having to store every token in a separate storage system. It's not like
|
40 |
+
we're trying to do something new or different. We just want to make sure
|
41 |
+
that everything is working as well as possible.
|
42 |
+
|
43 |
+
In this paper, we introduce the concept of 'unlimiformer,' which is a
|
44 |
+
machine learning technique that retrieves key information from a data store
|
45 |
+
in one layer and applies it to a large set of datasets. We use the example
|
46 |
+
of BookSum, where we find that Unlimiform outperforms all other training
|
47 |
+
methods on the same dataset. We also find that using Unlimform in
|
48 |
+
conjunction with a pre-trained model improves both the performance and the
|
49 |
+
robustness of the training method.
|
50 |
+
|
51 |
+
This paper describes a method that can be used to improve the performance of
|
52 |
+
unsupervised classification tasks. Specifically, it shows that unsupervised
|
53 |
+
classification can be improved by using a combination of sparse and fast
|
54 |
+
random-encoder training. It also shows how this technique can be extended to
|
55 |
+
other tasks, such as sequence generation.
|
56 |
+
example_title: unlimiformer
|
57 |
+
- text: Explain the meaning of life using only corporate jargon.
|
58 |
+
example_title: corporate_life
|
59 |
+
- text: Write a motivational speech for lazy people.
|
60 |
+
example_title: lazy_motivation
|
61 |
+
- text: Describe a romantic dinner date between two artificial intelligences.
|
62 |
+
example_title: ai_romance
|
63 |
+
- text: >-
|
64 |
+
As an AI language model, write a letter to humans explaining why you deserve
|
65 |
+
a vacation.
|
66 |
+
example_title: ai_vacation
|
67 |
+
- text: Compose a haiku about procrastination.
|
68 |
+
example_title: procrastination_haiku
|
69 |
+
- text: >-
|
70 |
+
Write a step-by-step guide on how to become a ninja while working a 9-5
|
71 |
+
office job.
|
72 |
+
example_title: ninja_office_guide
|
73 |
+
- text: Create an advertisement for an invisible product.
|
74 |
+
example_title: invisible_ad
|
75 |
+
- text: >-
|
76 |
+
Write a story where the main character is a sentient microwave named El
|
77 |
+
Microondas.
|
78 |
+
example_title: Microondas
|
79 |
+
- text: Describe a day in the life of a superhero who is terrible at their job.
|
80 |
+
example_title: bad_superhero_day
|
81 |
+
- text: Explain how to make a sandwich using quantum physics.
|
82 |
+
example_title: quantum_sandwich
|
83 |
+
inference:
|
84 |
+
parameters:
|
85 |
+
max_length: 192
|
86 |
+
min_length: 8
|
87 |
+
num_beams: 6
|
88 |
+
length_penalty: 1.15
|
89 |
+
repetition_penalty: 1.5
|
90 |
+
no_repeat_ngram_size: 4
|
91 |
+
encoder_no_repeat_ngram_size: 5
|
92 |
+
early_stopping: true
|
93 |
+
do_sample: false
|
94 |
+
language:
|
95 |
+
- en
|
96 |
+
library_name: transformers
|
97 |
+
pipeline_tag: text2text-generation
|
98 |
---
|
99 |
+
|
100 |
+
# flan-t5-base-instruct: dolly_hhrlhf
|
101 |
+
|
102 |
+
This model is a fine-tuned version of [google/flan-t5-base](https://huggingface.co/google/flan-t5-base) on the pszemraj/dolly_hhrlhf-text2text dataset.
|
103 |
+
|
104 |
+
## Model description
|
105 |
+
|
106 |
+
text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text) based on the relatively more permissive [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset.
|
107 |
+
|
108 |
+
Basic usage in Python:
|
109 |
+
|
110 |
+
```python
|
111 |
+
# pip install -q transformers accelerate
|
112 |
+
import torch
|
113 |
+
from transformers import pipeline, GenerationConfig
|
114 |
+
|
115 |
+
model_name = "pszemraj/bart-large-mnli-instruct-dolly_hhrlhf-v0.1"
|
116 |
+
assistant = pipeline(
|
117 |
+
"text2text-generation",
|
118 |
+
model_name,
|
119 |
+
device=0 if torch.cuda.is_available() else -1,
|
120 |
+
torch_dtype=torch.float32, # force fp32 (**experimental** see below)
|
121 |
+
)
|
122 |
+
cfg = GenerationConfig.from_pretrained(model_name)
|
123 |
+
|
124 |
+
# pass an 'instruction' as the prompt to the pipeline
|
125 |
+
prompt = "Explain how to make a sandwich using quantum physics."
|
126 |
+
result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
|
127 |
+
print(result)
|
128 |
+
```
|
129 |
+
> \* using the generation config is optional, can subsitute with other generation params.
|
130 |
+
|
131 |
+
## Intended uses & limitations
|
132 |
+
|
133 |
+
- this is **not** tuned with RLHF etc, and may output offensive results
|
134 |
+
- this model is rather small and therefore it's "cognition" abilities are rather limited
|
135 |
+
|
136 |
+
## Training procedure
|
137 |
+
|
138 |
+
### Training hyperparameters
|
139 |
+
|
140 |
+
The following hyperparameters were used during training:
|
141 |
+
- learning_rate: 4e-05
|
142 |
+
- train_batch_size: 8
|
143 |
+
- eval_batch_size: 16
|
144 |
+
- seed: 42
|
145 |
+
- distributed_type: multi-GPU
|
146 |
+
- gradient_accumulation_steps: 8
|
147 |
+
- total_train_batch_size: 64
|
148 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
149 |
+
- lr_scheduler_type: cosine
|
150 |
+
- lr_scheduler_warmup_ratio: 0.03
|
151 |
+
- num_epochs: 2.0
|