# Phi-4 Training Space Deployment Checklist

## Critical Configuration Review

Before updating the Hugging Face Space, verify each of these items to prevent deployment issues:

### 1. Model Configuration ✓

- [ ] Confirmed model name in transformers_config.json: `unsloth/phi-4-unsloth-bnb-4bit`
- [ ] BF16 precision enabled, FP16 disabled (`"bf16": true, "fp16": false`) 
- [ ] Chat template correctly set to `"phi"` in config
- [ ] LoRA parameters properly configured:
  - [ ] `r`: 32
  - [ ] `lora_alpha`: 16
  - [ ] `target_modules`: All required attention modules included
- [ ] Max sequence length matches dataset needs (default: 2048)

### 2. GPU & Memory Management ✓

- [ ] Per-device batch size set to 16 or lower
- [ ] Gradient accumulation steps set to 3 or higher
- [ ] Device mapping set to "auto" for multi-GPU 
- [ ] Max memory limit set to 85% of each GPU's capacity
- [ ] `PYTORCH_CUDA_ALLOC_CONF` includes `"expandable_segments:True"`
- [ ] Gradient checkpointing enabled (`"gradient_checkpointing": true`)
- [ ] Dataloader workers reduced to 2 (from 4)
- [ ] FSDP configuration enabled for multi-GPU setups

### 3. Dataset Handling ✓ 

- [ ] Dataset configuration correctly specified in dataset_config.json
- [ ] Conversation structure preserved (id + conversations fields)
- [ ] SimpleDataCollator configured to use apply_chat_template
- [ ] No re-ordering or sorting of the dataset (preserves original order)
- [ ] Sequential sampler used in dataloader (no shuffling)
- [ ] Max sequence length of 2048 applied 
- [ ] Format validation for first few examples enabled

### 4. Dependency Management ✓

- [ ] requirements.txt includes all necessary packages:
  - [ ] unsloth
  - [ ] peft
  - [ ] bitsandbytes
  - [ ] einops
  - [ ] sentencepiece
  - [ ] datasets
  - [ ] transformers
- [ ] Optional packages marked as such (e.g., flash-attn)
- [ ] Dependency version constraints avoid known conflicts

### 5. Error Handling & Logging ✓

- [ ] Proper error catching for dataset loading
- [ ] Fallback mechanisms for chat template application
- [ ] Clear, concise log messages that work with HF Space interface
- [ ] Memory usage tracking at key points (start, end, periodic)
- [ ] Third-party loggers set to WARNING to reduce noise
- [ ] Low-verbosity log format for better HF Space compatibility

### 6. Training Setup ✓

- [ ] Number of epochs properly configured (default: 3)
- [ ] Learning rate appropriate (default: 2e-5)
- [ ] Warmup ratio set (default: 0.05)
- [ ] Checkpointing frequency set to reasonable value (default: 100 steps)
- [ ] Output directory correctly configured
- [ ] HuggingFace Hub parameters set correctly if pushing models

### 7. Pre-Flight Verification ✓

- [ ] No linting errors or indentation issues
- [ ] Updated config values are consistent across files
- [ ] Batch size × gradient accumulation × GPUs gives reasonable total batch
- [ ] Verified that requirements.txt matches actual imports in code
- [ ] Confirmed tokenizer settings match the model requirements

---

## Last-Minute Configuration Changes

If you've made any configuration changes, record them here before deployment:

| Date | Parameter Changed | Old Value | New Value | Reason | Reviewer |
|------|-------------------|-----------|-----------|--------|----------|
|      |                   |           |           |        |          |
|      |                   |           |           |        |          |

---

## Deployment Notes

**Current Space Hardware**: 4× NVIDIA L4 GPUs (24GB VRAM each)

**Expected Training Speed**: ~XXX examples/second with current configuration  

**Memory Requirements**: Peak usage expected to be ~20GB per GPU

**Common Issues to Watch For**:
- OOM errors on GPU 0: If seen, reduce batch size by 2 and increase grad accumulation by 1
- Imbalanced GPU usage: Check device mapping and FSDP configuration
- Slow training: Verify that all GPUs are being utilized efficiently
- Log flooding: Reduce verbosity of component logs (transformers, datasets, etc.)

---

*Last Updated: 2025-03-09*