derek-thomas commited on
Commit
b4f4f9d
·
1 Parent(s): ed0ad35

Adding autotrain

Browse files
Files changed (1) hide show
  1. 02-autotrain.ipynb +199 -0
02-autotrain.ipynb ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "21111b1f-7cce-4e8b-8337-8f0cdab5804e",
6
+ "metadata": {},
7
+ "source": [
8
+ "# AutoTrain"
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "markdown",
13
+ "id": "dd09a9fd-4b90-48f3-b61c-d2349eb7f43e",
14
+ "metadata": {},
15
+ "source": [
16
+ "## Imports"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": null,
22
+ "id": "52543575-f92e-4038-ad13-30967f47eb7a",
23
+ "metadata": {},
24
+ "outputs": [],
25
+ "source": [
26
+ "import os\n",
27
+ "import subprocess\n",
28
+ "\n",
29
+ "import yaml"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "id": "74987944-abfb-44f8-9331-ffbb2f7698bb",
35
+ "metadata": {},
36
+ "source": [
37
+ "## Config"
38
+ ]
39
+ },
40
+ {
41
+ "cell_type": "markdown",
42
+ "id": "97c25070-775a-4fb1-9694-4579250686a6",
43
+ "metadata": {},
44
+ "source": [
45
+ "### Template\n",
46
+ "Im creating a template so we can iterate through each of our experiments.\n",
47
+ "\n",
48
+ "Here you can see a few design decisions:\n",
49
+ "- We leave `project_name` and `text_column` empty to overwrite later per experiment\n",
50
+ "- We log in tensorboard, you can use wandb, but you will need to install it in the AutoTrain env that is run on spaces, which gets complex\n",
51
+ "- I choose an `l4x1` from [these options](https://github.com/huggingface/autotrain-advanced/blob/2d787b2033414d06f1e9be2ea0caacad3097f5e8/src/autotrain/backends/base.py#L21)\n",
52
+ " - This is a [well priced](https://huggingface.co/pricing#spaces) way of training a 7B moodel \n",
53
+ " - It's very efficient as well at 24GB VRAM\n",
54
+ "- It's becoming less common to use a `valid_split` \n",
55
+ "- I run 2 epochs as the loss still decreases steadily, but some say for LoRAs you should just do 1\n",
56
+ "- Its a good idea use `all-linear` when using LoRA "
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "id": "dc2a8514-51c1-404b-8cfa-6637cc810668",
63
+ "metadata": {},
64
+ "outputs": [],
65
+ "source": [
66
+ "# Base config\n",
67
+ "config_template = {\n",
68
+ " \"task\": \"llm-sft\",\n",
69
+ " \"base_model\": \"mistralai/Mistral-7B-Instruct-v0.3\",\n",
70
+ " \"project_name\": \"\",\n",
71
+ " \"log\": \"tensorboard\",\n",
72
+ " \"backend\": \"spaces-l4x1\",\n",
73
+ " \"data\": {\n",
74
+ " \"path\": \"derek-thomas/labeled-multiple-choice-explained-mistral-tokenized\",\n",
75
+ " \"train_split\": \"train\",\n",
76
+ " \"valid_split\": None,\n",
77
+ " \"chat_template\": \"none\",\n",
78
+ " \"column_mapping\": {\n",
79
+ " \"text_column\": \"\"\n",
80
+ " },\n",
81
+ " },\n",
82
+ " \"params\": {\n",
83
+ " \"block_size\": 1024,\n",
84
+ " \"model_max_length\": 1024,\n",
85
+ " \"epochs\": 2,\n",
86
+ " \"batch_size\": 1,\n",
87
+ " \"lr\": 3e-5,\n",
88
+ " \"peft\": True,\n",
89
+ " \"quantization\": \"int4\",\n",
90
+ " \"target_modules\": \"all-linear\",\n",
91
+ " \"padding\": \"left\",\n",
92
+ " \"optimizer\": \"adamw_torch\",\n",
93
+ " \"scheduler\": \"linear\",\n",
94
+ " \"gradient_accumulation\": 8,\n",
95
+ " \"mixed_precision\": \"bf16\",\n",
96
+ " },\n",
97
+ " \"hub\": {\n",
98
+ " \"username\": \"derek-thomas\",\n",
99
+ " \"token\": os.getenv('HF_TOKEN'),\n",
100
+ " \"push_to_hub\": True,\n",
101
+ " },\n",
102
+ "}"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "markdown",
107
+ "id": "22eb3d3a-0ab0-4f79-98c2-513a34ce1b6d",
108
+ "metadata": {},
109
+ "source": [
110
+ "### Experiments\n",
111
+ "Here we choose the `project_name` and `text_column` for each experiment."
112
+ ]
113
+ },
114
+ {
115
+ "cell_type": "code",
116
+ "execution_count": null,
117
+ "id": "957eb2b7-feec-422f-ba46-b293d9a77c1b",
118
+ "metadata": {},
119
+ "outputs": [],
120
+ "source": [
121
+ "project_suffixes = [\"RFA-gpt3-5\", \"RFA-mistral\", \"FAR-gpt3-5\", \"FAR-mistral\", \"FA\"]\n",
122
+ "text_columns = [\"conversation_RFA_gpt3_5\", \"conversation_RFA_mistral\", \"conversation_FAR_gpt3_5\",\n",
123
+ " \"conversation_FAR_mistral\", \"conversation_FA\"]"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "markdown",
128
+ "id": "a5913085-83c9-4133-a90d-318fd13cc14e",
129
+ "metadata": {},
130
+ "source": [
131
+ "Directory to store generated configs"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "code",
136
+ "execution_count": null,
137
+ "id": "b86702bf-f494-4951-863e-be5b8462fbd1",
138
+ "metadata": {},
139
+ "outputs": [],
140
+ "source": [
141
+ "output_dir = \"./autotrain_configs\"\n",
142
+ "os.makedirs(output_dir, exist_ok=True)"
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "markdown",
147
+ "id": "3053d1e1-ca40-460c-8999-0787a1751d00",
148
+ "metadata": {},
149
+ "source": [
150
+ "## AutoTrain for each Experiment"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "code",
155
+ "execution_count": null,
156
+ "id": "025ccd2f-de54-4ac2-9f36-f606876dcd3c",
157
+ "metadata": {},
158
+ "outputs": [],
159
+ "source": [
160
+ "# Generate configs and run commands\n",
161
+ "for project_suffix, text_column in zip(project_suffixes, text_columns):\n",
162
+ " # Modify the config\n",
163
+ " config = config_template.copy()\n",
164
+ " config[\"project_name\"] = f\"mistral-v03-poe-{project_suffix}\"\n",
165
+ " config[\"data\"][\"column_mapping\"][\"text_column\"] = text_column\n",
166
+ "\n",
167
+ " # Save the config to a YAML file\n",
168
+ " config_path = os.path.join(output_dir, f\"{text_column}.yml\")\n",
169
+ " with open(config_path, \"w\") as f:\n",
170
+ " yaml.dump(config, f)\n",
171
+ "\n",
172
+ " # Run the command\n",
173
+ " print(f\"Running autotrain with config: {config_path}\")\n",
174
+ " subprocess.run([\"autotrain\", \"--config\", config_path])"
175
+ ]
176
+ }
177
+ ],
178
+ "metadata": {
179
+ "kernelspec": {
180
+ "display_name": "Python 3 (ipykernel)",
181
+ "language": "python",
182
+ "name": "python3"
183
+ },
184
+ "language_info": {
185
+ "codemirror_mode": {
186
+ "name": "ipython",
187
+ "version": 3
188
+ },
189
+ "file_extension": ".py",
190
+ "mimetype": "text/x-python",
191
+ "name": "python",
192
+ "nbconvert_exporter": "python",
193
+ "pygments_lexer": "ipython3",
194
+ "version": "3.11.10"
195
+ }
196
+ },
197
+ "nbformat": 4,
198
+ "nbformat_minor": 5
199
+ }