File size: 2,096 Bytes
ae57ea2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4a1fd53
 
 
ae57ea2
4a1fd53
ae57ea2
 
 
 
 
4a1fd53
 
 
ae57ea2
4a1fd53
ae57ea2
4a1fd53
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# Phi-4 Training Critical Deployment Checklist

## Essential Configuration Requirements

### 1. Model Configuration
- [ ] Model name: `unsloth/phi-4-unsloth-bnb-4bit`
- [ ] BF16 precision enabled, FP16 disabled
- [ ] Appropriate sequence length (2048)
- [ ] LoRA parameters correctly configured (r: 32, alpha: 16)

### 2. Hardware & Resource Management
- [ ] Per-device batch size ≤ 16
- [ ] Gradient accumulation steps ≥ 3
- [ ] Gradient checkpointing enabled
- [ ] Memory usage limits properly set (85% of GPU capacity)

### 3. Critical Dataset Handling Rules
- [ ] **NO REORDERING of dataset entries** - original order must be preserved
- [ ] **NO COMBINING of separate entries** - each entry must remain distinct
- [ ] **SEQUENTIAL PROCESSING required** - entries must be processed one after another
- [ ] `sort_by_id` and `maintain_paper_order` flags properly set to preserve data sequence
- [ ] Sequential sampler used with no shuffling (`"shuffle": false`)
- [ ] Dataset sequential integrity verified with validation samples
- [ ] Conversation structure preserved (original format maintained)

### 4. Essential Error Handling
- [ ] Clear error catching for dataset loading issues
- [ ] Memory tracking at key training points
- [ ] Low-verbosity logging for HF Space compatibility

### 5. Training Core Requirements
- [ ] Appropriate learning rate (2e-5)
- [ ] Proper checkpointing frequency
- [ ] Hub settings correctly configured for model saving

---

## Pre-Deployment Verification

| Requirement | Status | Notes |
|-------------|--------|-------|
| Data sequential integrity | | Confirm entries processed in order |
| GPU memory within limits | | Check peak memory doesn't exceed 20GB per GPU |
| Training batch verification | | Verify first few batches maintain proper order |

---

**Current Hardware**: 4× NVIDIA L4 GPUs (24GB VRAM each)

**CRITICAL REMINDER**: Data sequence preservation is the highest priority - any shuffling, reordering, or combining of entries will compromise model quality.

*Last Updated: 2025-03-09*