{ "cells": [ { "cell_type": "markdown", "id": "21111b1f-7cce-4e8b-8337-8f0cdab5804e", "metadata": {}, "source": [ "# AutoTrain" ] }, { "cell_type": "markdown", "id": "dd09a9fd-4b90-48f3-b61c-d2349eb7f43e", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": null, "id": "52543575-f92e-4038-ad13-30967f47eb7a", "metadata": {}, "outputs": [], "source": [ "import os\n", "import subprocess\n", "\n", "import yaml" ] }, { "cell_type": "markdown", "id": "74987944-abfb-44f8-9331-ffbb2f7698bb", "metadata": {}, "source": [ "## Config" ] }, { "cell_type": "markdown", "id": "97c25070-775a-4fb1-9694-4579250686a6", "metadata": {}, "source": [ "### Template\n", "Im creating a template so we can iterate through each of our experiments.\n", "\n", "Here you can see a few design decisions:\n", "- We leave `project_name` and `text_column` empty to overwrite later per experiment\n", "- We log in tensorboard, you can use wandb, but you will need to install it in the AutoTrain env that is run on spaces, which gets complex\n", "- I choose an `l4x1` from [these options](https://github.com/huggingface/autotrain-advanced/blob/2d787b2033414d06f1e9be2ea0caacad3097f5e8/src/autotrain/backends/base.py#L21)\n", " - This is a [well priced](https://huggingface.co/pricing#spaces) way of training a 7B moodel \n", " - It's very efficient as well at 24GB VRAM\n", "- It's becoming less common to use a `valid_split` \n", "- I run 2 epochs as the loss still decreases steadily, but some say for LoRAs you should just do 1\n", "- Its a good idea use `all-linear` when using LoRA " ] }, { "cell_type": "code", "execution_count": null, "id": "dc2a8514-51c1-404b-8cfa-6637cc810668", "metadata": {}, "outputs": [], "source": [ "# Base config\n", "config_template = {\n", " \"task\": \"llm-sft\",\n", " \"base_model\": \"mistralai/Mistral-7B-Instruct-v0.3\",\n", " \"project_name\": \"\",\n", " \"log\": \"tensorboard\",\n", " \"backend\": \"spaces-l4x1\",\n", " \"data\": {\n", " \"path\": \"derek-thomas/labeled-multiple-choice-explained-mistral-tokenized\",\n", " \"train_split\": \"train\",\n", " \"valid_split\": None,\n", " \"chat_template\": \"none\",\n", " \"column_mapping\": {\n", " \"text_column\": \"\"\n", " },\n", " },\n", " \"params\": {\n", " \"block_size\": 1024,\n", " \"model_max_length\": 1024,\n", " \"epochs\": 2,\n", " \"batch_size\": 1,\n", " \"lr\": 3e-5,\n", " \"peft\": True,\n", " \"quantization\": \"int4\",\n", " \"target_modules\": \"all-linear\",\n", " \"padding\": \"left\",\n", " \"optimizer\": \"adamw_torch\",\n", " \"scheduler\": \"linear\",\n", " \"gradient_accumulation\": 8,\n", " \"mixed_precision\": \"bf16\",\n", " },\n", " \"hub\": {\n", " \"username\": \"derek-thomas\",\n", " \"token\": os.getenv('HF_TOKEN'),\n", " \"push_to_hub\": True,\n", " },\n", "}" ] }, { "cell_type": "markdown", "id": "22eb3d3a-0ab0-4f79-98c2-513a34ce1b6d", "metadata": {}, "source": [ "### Experiments\n", "Here we choose the `project_name` and `text_column` for each experiment." ] }, { "cell_type": "code", "execution_count": null, "id": "957eb2b7-feec-422f-ba46-b293d9a77c1b", "metadata": {}, "outputs": [], "source": [ "project_suffixes = [\"RFA-gpt3-5\", \"RFA-mistral\", \"FAR-gpt3-5\", \"FAR-mistral\", \"FA\"]\n", "text_columns = [\"conversation_RFA_gpt3_5\", \"conversation_RFA_mistral\", \"conversation_FAR_gpt3_5\",\n", " \"conversation_FAR_mistral\", \"conversation_FA\"]" ] }, { "cell_type": "markdown", "id": "a5913085-83c9-4133-a90d-318fd13cc14e", "metadata": {}, "source": [ "Directory to store generated configs" ] }, { "cell_type": "code", "execution_count": null, "id": "b86702bf-f494-4951-863e-be5b8462fbd1", "metadata": {}, "outputs": [], "source": [ "output_dir = \"./autotrain_configs\"\n", "os.makedirs(output_dir, exist_ok=True)" ] }, { "cell_type": "markdown", "id": "3053d1e1-ca40-460c-8999-0787a1751d00", "metadata": {}, "source": [ "## AutoTrain for each Experiment" ] }, { "cell_type": "code", "execution_count": null, "id": "025ccd2f-de54-4ac2-9f36-f606876dcd3c", "metadata": {}, "outputs": [], "source": [ "# Generate configs and run commands\n", "for project_suffix, text_column in zip(project_suffixes, text_columns):\n", " # Modify the config\n", " config = config_template.copy()\n", " config[\"project_name\"] = f\"mistral-v03-poe-{project_suffix}\"\n", " config[\"data\"][\"column_mapping\"][\"text_column\"] = text_column\n", "\n", " # Save the config to a YAML file\n", " config_path = os.path.join(output_dir, f\"{text_column}.yml\")\n", " with open(config_path, \"w\") as f:\n", " yaml.dump(config, f)\n", "\n", " # Run the command\n", " print(f\"Running autotrain with config: {config_path}\")\n", " subprocess.run([\"autotrain\", \"--config\", config_path])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }