Spaces:
Runtime error
Runtime error
File size: 8,295 Bytes
1d4c4c4 a90f827 578eea8 a90f827 578eea8 a90f827 578eea8 a90f827 578eea8 a90f827 578eea8 a90f827 1d4c4c4 a57357b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 |
---
title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
---
# Phi-4 Unsloth Optimized Training
This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.
## Installation
This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:
### Essential Dependencies
- **unsloth** (>=2024.3): Required for optimized 4-bit training
- **peft** (>=0.9.0): Required for parameter-efficient fine-tuning
- **transformers** (>=4.36.0): For model architecture and tokenization
- **einops**: Required by Unsloth for tensor manipulation
- **sentencepiece**: Required for tokenization
### Optional Dependencies
- **flash-attn**: Optional for faster attention computation (not included by default as it can cause build issues)
## Features
- 4-bit quantization using Unsloth
- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing
## Configuration Files
- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset processing settings
- `requirements.txt`: Required dependencies
## Training Process
The training utilizes the following optimizations:
- Unsloth's 4-bit quantization
- Custom chat templates for Phi-4
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation
## Dataset
Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching
## Hardware Requirements
- GPU: A10G or better
- VRAM: 24GB minimum
- RAM: 32GB recommended
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Phase 1: Domain Adaptation (Unsupervised)
This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).
## Overview
Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.
## Files
### Core Training Files
- `run_transformers_training.py`: Main script for domain adaptation
- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset loading and processing settings
- `requirements.txt`: Required Python packages
### Analysis & Utilities
- `check_tokenization.py`: Script to analyze token distributions
- `update_space.py`: Hugging Face Space update utility
- `.env`: Environment variables (API tokens, etc.)
## Setup
1. **Environment Setup**:
```bash
python -m venv venv
source venv/bin/activate # or `venv\Scripts\activate` on Windows
pip install -r requirements.txt
```
2. **Environment Variables**:
Create `.env` file with:
```
HUGGINGFACE_TOKEN=your_token_here
```
3. **Verify Setup**:
```bash
python check_tokenization.py # Ensures tokenizer works
```
## How It Works
1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
5. **Monitoring**: Logs detailed metrics and statistics during training
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub
## Key Features
### Memory-Efficient Training
The training setup is optimized for A10G GPUs:
- Uses pre-quantized 4-bit model (no additional quantization needed)
- Gradient checkpointing for memory efficiency
- Flash attention for faster training
- bfloat16 mixed precision training
- Optimized batch sizes for maximum throughput
### Sequential Processing
The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Processing chunks sequentially (average 1,673 tokens per chunk)
### Data Collator
The `SimpleDataCollator` class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully
### Checkpointing
The training process saves checkpoints:
- Every 200 steps
- Pushes to Hub on every save
- Maintains up to 5 recent checkpoints
- Automatically resumes from the latest checkpoint if interrupted
## Hardware Requirements
This training setup is optimized for:
- 2x NVIDIA A10G GPUs (24GB VRAM each)
- 92GB System RAM
- CUDA 11.8 or higher
Memory breakdown per GPU:
- Model (4-bit): ~3.5GB
- Optimizer states: ~1GB
- Batch memory: ~2GB
- Peak usage: 18-20GB
- Safe headroom: 4-6GB
## Configuration
Key parameters in `transformers_config.json`:
- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
- `learning_rate`: 2e-5
- `num_train_epochs`: 3
- `per_device_train_batch_size`: 16
- `gradient_accumulation_steps`: 4
- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
- `max_seq_length`: 2048
- `lr_scheduler_type`: "cosine"
- `warmup_ratio`: 0.03
- `neftune_noise_alpha`: 5
The configuration is optimized for:
- Maximum memory efficiency with pre-quantized model
- Stable training with cosine learning rate schedule
- Effective gradient updates with accumulation
- Regular checkpointing and Hub updates
## Running Domain Adaptation
To start domain adaptation:
```bash
python run_transformers_training.py
```
The script will:
1. Load the pre-quantized model and dataset
2. Apply optimized training parameters
3. Process the data sequentially
4. Train the model for 3 epochs
5. Save and push checkpoints to Hub regularly
## Using the Model
After training, you can use the domain-adapted model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
device_map="auto",
torch_dtype="bfloat16")
# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Chat Format Example
Phi-4 works best with its native chat template:
```python
from transformers import pipeline
pipeline = pipeline(
"text-generation",
model="George-API/phi-4-research-assistant",
model_kwargs={"torch_dtype": "bfloat16"},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are an expert in cognitive science."},
{"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]
outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])
```
## Expected Outcomes
After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on domain-specific tasks
- Be ready for supervised fine-tuning in Phase 2
## Next Steps
After completing domain adaptation:
1. Evaluate the model's performance on cognitive science texts
2. Proceed to Phase 2 (Supervised Fine-Tuning)
3. Use TensorBoard to analyze training metrics |