--- title: Phi-4 Unsloth Training emoji: 🧠 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 5.17.0 app_file: app.py pinned: false license: mit --- # Phi-4 Unsloth Optimized Training This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations. ## Installation This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included: ### Installation Process The project uses a single consolidated requirements file that maintains the proper installation order of dependencies: 1. **All Dependencies (requirements.txt)**: - Contains all required packages in the correct installation order - Install with: `pip install -r requirements.txt` - The file is organized with clear sections: - Base dependencies (installed first) - Main dependencies - Optional dependencies (commented out by default) 2. **Flash Attention** (Optional): - For faster attention computation - Install with: `pip install flash-attn==2.5.2 --no-build-isolation` - Or uncomment the flash-attn line in requirements.txt 3. **Automated Installation**: - For convenience, you can use the included script: - Basic install: `python install_requirements.py` - With flash-attn: `python install_requirements.py --flash` This approach simplifies dependency management while still maintaining proper installation order. ### Essential Dependencies - **unsloth** (>=2024.3): Required for optimized 4-bit training - **peft** (>=0.9.0): Required for parameter-efficient fine-tuning - **transformers** (>=4.36.0): For model architecture and tokenization - **einops**: Required by Unsloth for tensor manipulation - **sentencepiece**: Required for tokenization ### Optional Dependencies - **flash-attn**: Optional for faster attention computation (not included by default as it can cause build issues) ## Features - 4-bit quantization using Unsloth - Optimized training pipeline - Cognitive dataset integration - Advanced memory management - Gradient checkpointing - Sequential data processing ## Configuration Files - `transformers_config.json`: Model and training parameters - `hardware_config.json`: Hardware-specific optimizations - `dataset_config.json`: Dataset processing settings - `requirements.txt`: Required dependencies ## Training Process The training utilizes the following optimizations: - Unsloth's 4-bit quantization - Custom chat templates for Phi-4 - Paper-order preservation - Efficient memory usage - Gradient accumulation ## Dataset Training uses the cognitive dataset with: - Maintained paper order - Proper metadata handling - Optimized sequence length - Efficient batching ## Hardware Requirements - GPU: A10G or better - VRAM: 24GB minimum - RAM: 32GB recommended Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Phase 1: Domain Adaptation (Unsupervised) This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant). ## Overview Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2. ## Files ### Core Training Files - `run_transformers_training.py`: Main script for domain adaptation - `transformers_config.json`: Model and training parameters - `hardware_config.json`: Hardware-specific optimizations - `dataset_config.json`: Dataset loading and processing settings - `requirements.txt`: Required Python packages ### Analysis & Utilities - `check_tokenization.py`: Script to analyze token distributions - `update_space.py`: Hugging Face Space update utility - `.env`: Environment variables (API tokens, etc.) ## Setup 1. **Environment Setup**: ```bash python -m venv venv source venv/bin/activate # or `venv\Scripts\activate` on Windows pip install -r requirements.txt ``` 2. **Environment Variables**: Create `.env` file with: ``` HUGGINGFACE_TOKEN=your_token_here ``` 3. **Verify Setup**: ```bash python check_tokenization.py # Ensures tokenizer works ``` ## How It Works 1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset 2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers 3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training 4. **Checkpointing**: Saves regular checkpoints and pushes to Hub 5. **Monitoring**: Logs detailed metrics and statistics during training 6. **Model Publishing**: Pushes the trained model to Hugging Face Hub ## Key Features ### Memory-Efficient Training The training setup is optimized for A10G GPUs: - Uses pre-quantized 4-bit model (no additional quantization needed) - Gradient checkpointing for memory efficiency - Flash attention for faster training - bfloat16 mixed precision training - Optimized batch sizes for maximum throughput ### Sequential Processing The training script ensures that chunks from the same research paper are processed together by: - Sorting the dataset by ID - Using a SequentialSampler to maintain order - Processing chunks sequentially (average 1,673 tokens per chunk) ### Data Collator The `SimpleDataCollator` class: - Preserves pre-tokenized data format - Processes each entry independently - Provides detailed logging of processing statistics - Handles errors gracefully ### Checkpointing The training process saves checkpoints: - Every 200 steps - Pushes to Hub on every save - Maintains up to 5 recent checkpoints - Automatically resumes from the latest checkpoint if interrupted ## Hardware Requirements This training setup is optimized for: - 2x NVIDIA A10G GPUs (24GB VRAM each) - 92GB System RAM - CUDA 11.8 or higher Memory breakdown per GPU: - Model (4-bit): ~3.5GB - Optimizer states: ~1GB - Batch memory: ~2GB - Peak usage: 18-20GB - Safe headroom: 4-6GB ## Configuration Key parameters in `transformers_config.json`: - `model_name`: unsloth/phi-4-unsloth-bnb-4bit - `learning_rate`: 2e-5 - `num_train_epochs`: 3 - `per_device_train_batch_size`: 16 - `gradient_accumulation_steps`: 4 - `effective_batch_size`: 128 (16 * 4 * 2 GPUs) - `max_seq_length`: 2048 - `lr_scheduler_type`: "cosine" - `warmup_ratio`: 0.03 - `neftune_noise_alpha`: 5 The configuration is optimized for: - Maximum memory efficiency with pre-quantized model - Stable training with cosine learning rate schedule - Effective gradient updates with accumulation - Regular checkpointing and Hub updates ## Running Domain Adaptation To start domain adaptation: ```bash python run_transformers_training.py ``` The script will: 1. Load the pre-quantized model and dataset 2. Apply optimized training parameters 3. Process the data sequentially 4. Train the model for 3 epochs 5. Save and push checkpoints to Hub regularly ## Using the Model After training, you can use the domain-adapted model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load the domain-adapted model model_name = "George-API/phi-4-research-assistant" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="bfloat16") # Generate text input_text = "The hippocampus is involved in" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_length=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Chat Format Example Phi-4 works best with its native chat template: ```python from transformers import pipeline pipeline = pipeline( "text-generation", model="George-API/phi-4-research-assistant", model_kwargs={"torch_dtype": "bfloat16"}, device_map="auto", ) messages = [ {"role": "system", "content": "You are an expert in cognitive science."}, {"role": "user", "content": "Explain the role of the hippocampus in memory formation."}, ] outputs = pipeline(messages, max_new_tokens=256) print(outputs[0]["generated_text"]) ``` ## Expected Outcomes After domain adaptation, the model should: - Have a better understanding of cognitive science terminology - Show improved performance on domain-specific tasks - Be ready for supervised fine-tuning in Phase 2 ## Next Steps After completing domain adaptation: 1. Evaluate the model's performance on cognitive science texts 2. Proceed to Phase 2 (Supervised Fine-Tuning) 3. Use TensorBoard to analyze training metrics