---
title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
---

# Phi-4 Unsloth Optimized Training

This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.

## Installation

This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:

### Installation Process

The project uses a single consolidated requirements file that maintains the proper installation order of dependencies:

1. **All Dependencies (requirements.txt)**:
   - Contains all required packages in the correct installation order
   - Install with: `pip install -r requirements.txt`
   - The file is organized with clear sections:
     - Base dependencies (installed first)
     - Main dependencies
     - Optional dependencies (commented out by default)

2. **Flash Attention** (Optional):
   - For faster attention computation
   - Install with: `pip install flash-attn==2.5.2 --no-build-isolation`
   - Or uncomment the flash-attn line in requirements.txt

3. **Automated Installation**:
   - For convenience, you can use the included script:
   - Basic install: `python install_requirements.py`
   - With flash-attn: `python install_requirements.py --flash`

This approach simplifies dependency management while still maintaining proper installation order.

### Essential Dependencies

- **unsloth** (>=2024.3): Required for optimized 4-bit training
- **peft** (>=0.9.0): Required for parameter-efficient fine-tuning
- **transformers** (>=4.36.0): For model architecture and tokenization
- **einops**: Required by Unsloth for tensor manipulation
- **sentencepiece**: Required for tokenization

### Optional Dependencies

- **flash-attn**: Optional for faster attention computation (not included by default as it can cause build issues)

## Features

- 4-bit quantization using Unsloth
- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing

## Configuration Files

- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset processing settings
- `requirements.txt`: Required dependencies

## Training Process

The training utilizes the following optimizations:
- Unsloth's 4-bit quantization
- Custom chat templates for Phi-4
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation

## Dataset

Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching

## Hardware Requirements

- GPU: A10G or better
- VRAM: 24GB minimum
- RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).

## Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

## Files

### Core Training Files
- `run_transformers_training.py`: Main script for domain adaptation
- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset loading and processing settings
- `requirements.txt`: Required Python packages

### Analysis & Utilities
- `check_tokenization.py`: Script to analyze token distributions
- `update_space.py`: Hugging Face Space update utility
- `.env`: Environment variables (API tokens, etc.)

## Setup

1. **Environment Setup**:
   ```bash
   python -m venv venv
   source venv/bin/activate  # or `venv\Scripts\activate` on Windows
   pip install -r requirements.txt
   ```

2. **Environment Variables**:
   Create `.env` file with:
   ```
   HUGGINGFACE_TOKEN=your_token_here
   ```

3. **Verify Setup**:
   ```bash
   python check_tokenization.py  # Ensures tokenizer works
   ```

## How It Works

1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
5. **Monitoring**: Logs detailed metrics and statistics during training
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub

## Key Features

### Memory-Efficient Training

The training setup is optimized for A10G GPUs:
- Uses pre-quantized 4-bit model (no additional quantization needed)
- Gradient checkpointing for memory efficiency
- Flash attention for faster training
- bfloat16 mixed precision training
- Optimized batch sizes for maximum throughput

### Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Processing chunks sequentially (average 1,673 tokens per chunk)

### Data Collator

The `SimpleDataCollator` class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully

### Checkpointing

The training process saves checkpoints:
- Every 200 steps
- Pushes to Hub on every save
- Maintains up to 5 recent checkpoints
- Automatically resumes from the latest checkpoint if interrupted

## Hardware Requirements

This training setup is optimized for:
- 2x NVIDIA A10G GPUs (24GB VRAM each)
- 92GB System RAM
- CUDA 11.8 or higher

Memory breakdown per GPU:
- Model (4-bit): ~3.5GB
- Optimizer states: ~1GB
- Batch memory: ~2GB
- Peak usage: 18-20GB
- Safe headroom: 4-6GB

## Configuration

Key parameters in `transformers_config.json`:

- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
- `learning_rate`: 2e-5
- `num_train_epochs`: 3
- `per_device_train_batch_size`: 16
- `gradient_accumulation_steps`: 4
- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
- `max_seq_length`: 2048
- `lr_scheduler_type`: "cosine"
- `warmup_ratio`: 0.03
- `neftune_noise_alpha`: 5

The configuration is optimized for:
- Maximum memory efficiency with pre-quantized model
- Stable training with cosine learning rate schedule
- Effective gradient updates with accumulation
- Regular checkpointing and Hub updates

## Running Domain Adaptation

To start domain adaptation:

```bash
python run_transformers_training.py
```

The script will:
1. Load the pre-quantized model and dataset
2. Apply optimized training parameters
3. Process the data sequentially
4. Train the model for 3 epochs
5. Save and push checkpoints to Hub regularly

## Using the Model

After training, you can use the domain-adapted model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the domain-adapted model
model_name = "George-API/phi-4-research-assistant"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                           device_map="auto",
                                           torch_dtype="bfloat16")

# Generate text
input_text = "The hippocampus is involved in"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Chat Format Example

Phi-4 works best with its native chat template:

```python
from transformers import pipeline

pipeline = pipeline(
    "text-generation",
    model="George-API/phi-4-research-assistant",
    model_kwargs={"torch_dtype": "bfloat16"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert in cognitive science."},
    {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},
]

outputs = pipeline(messages, max_new_tokens=256)
print(outputs[0]["generated_text"])
```

## Expected Outcomes

After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on domain-specific tasks
- Be ready for supervised fine-tuning in Phase 2

## Next Steps

After completing domain adaptation:
1. Evaluate the model's performance on cognitive science texts
2. Proceed to Phase 2 (Supervised Fine-Tuning)
3. Use TensorBoard to analyze training metrics