File size: 8,295 Bytes
1d4c4c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a90f827
 
578eea8
a90f827
578eea8
a90f827
 
 
578eea8
 
 
a90f827
578eea8
a90f827
578eea8
a90f827
1d4c4c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a57357b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---

title: Phi-4 Unsloth Training
emoji: 🧠
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.17.0
app_file: app.py
pinned: false
license: mit
---


# Phi-4 Unsloth Optimized Training

This space is dedicated to training Microsoft's Phi-4 model using Unsloth optimizations for enhanced performance and efficiency. The training process utilizes 4-bit quantization and advanced memory optimizations.

## Installation

This Hugging Face Space automatically installs dependencies from requirements.txt. The following packages are included:

### Essential Dependencies

- **unsloth** (>=2024.3): Required for optimized 4-bit training
- **peft** (>=0.9.0): Required for parameter-efficient fine-tuning
- **transformers** (>=4.36.0): For model architecture and tokenization
- **einops**: Required by Unsloth for tensor manipulation
- **sentencepiece**: Required for tokenization

### Optional Dependencies

- **flash-attn**: Optional for faster attention computation (not included by default as it can cause build issues)

## Features

- 4-bit quantization using Unsloth
- Optimized training pipeline
- Cognitive dataset integration
- Advanced memory management
- Gradient checkpointing
- Sequential data processing

## Configuration Files

- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset processing settings
- `requirements.txt`: Required dependencies

## Training Process

The training utilizes the following optimizations:
- Unsloth's 4-bit quantization
- Custom chat templates for Phi-4
- Paper-order preservation
- Efficient memory usage
- Gradient accumulation

## Dataset

Training uses the cognitive dataset with:
- Maintained paper order
- Proper metadata handling
- Optimized sequence length
- Efficient batching

## Hardware Requirements

- GPU: A10G or better
- VRAM: 24GB minimum
- RAM: 32GB recommended

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Phase 1: Domain Adaptation (Unsupervised)

This directory contains the code and configuration for domain adaptation of the phi-4-unsloth-bnb-4bit model to the cognitive science domain. This phase produces our domain-adapted model: [George-API/phi-4-research-assistant](https://huggingface.co/George-API/phi-4-research-assistant).

## Overview

Domain adaptation is the first phase of our training process, where we expose the model to a large corpus of cognitive science texts to help it learn domain-specific vocabulary, concepts, and patterns. This phase prepares the model for the more focused supervised fine-tuning in Phase 2.

## Files

### Core Training Files
- `run_transformers_training.py`: Main script for domain adaptation
- `transformers_config.json`: Model and training parameters
- `hardware_config.json`: Hardware-specific optimizations
- `dataset_config.json`: Dataset loading and processing settings
- `requirements.txt`: Required Python packages

### Analysis & Utilities
- `check_tokenization.py`: Script to analyze token distributions
- `update_space.py`: Hugging Face Space update utility
- `.env`: Environment variables (API tokens, etc.)

## Setup

1. **Environment Setup**:
   ```bash

   python -m venv venv

   source venv/bin/activate  # or `venv\Scripts\activate` on Windows

   pip install -r requirements.txt

   ```

2. **Environment Variables**:
   Create `.env` file with:
   ```

   HUGGINGFACE_TOKEN=your_token_here

   ```

3. **Verify Setup**:
   ```bash

   python check_tokenization.py  # Ensures tokenizer works

   ```

## How It Works

1. **Data Loading**: Loads pre-tokenized data from the Hugging Face dataset
2. **Sequential Processing**: Processes data in order, maintaining the integrity of research papers
3. **Efficient Training**: Uses pre-quantized Unsloth 4-bit model for memory-efficient and faster training
4. **Checkpointing**: Saves regular checkpoints and pushes to Hub
5. **Monitoring**: Logs detailed metrics and statistics during training
6. **Model Publishing**: Pushes the trained model to Hugging Face Hub

## Key Features

### Memory-Efficient Training

The training setup is optimized for A10G GPUs:
- Uses pre-quantized 4-bit model (no additional quantization needed)
- Gradient checkpointing for memory efficiency
- Flash attention for faster training
- bfloat16 mixed precision training
- Optimized batch sizes for maximum throughput

### Sequential Processing

The training script ensures that chunks from the same research paper are processed together by:
- Sorting the dataset by ID
- Using a SequentialSampler to maintain order
- Processing chunks sequentially (average 1,673 tokens per chunk)

### Data Collator

The `SimpleDataCollator` class:
- Preserves pre-tokenized data format
- Processes each entry independently
- Provides detailed logging of processing statistics
- Handles errors gracefully

### Checkpointing

The training process saves checkpoints:
- Every 200 steps
- Pushes to Hub on every save
- Maintains up to 5 recent checkpoints
- Automatically resumes from the latest checkpoint if interrupted

## Hardware Requirements

This training setup is optimized for:
- 2x NVIDIA A10G GPUs (24GB VRAM each)
- 92GB System RAM
- CUDA 11.8 or higher

Memory breakdown per GPU:
- Model (4-bit): ~3.5GB
- Optimizer states: ~1GB
- Batch memory: ~2GB
- Peak usage: 18-20GB
- Safe headroom: 4-6GB

## Configuration

Key parameters in `transformers_config.json`:

- `model_name`: unsloth/phi-4-unsloth-bnb-4bit
- `learning_rate`: 2e-5
- `num_train_epochs`: 3
- `per_device_train_batch_size`: 16
- `gradient_accumulation_steps`: 4
- `effective_batch_size`: 128 (16 * 4 * 2 GPUs)
- `max_seq_length`: 2048
- `lr_scheduler_type`: "cosine"
- `warmup_ratio`: 0.03
- `neftune_noise_alpha`: 5

The configuration is optimized for:
- Maximum memory efficiency with pre-quantized model
- Stable training with cosine learning rate schedule
- Effective gradient updates with accumulation
- Regular checkpointing and Hub updates

## Running Domain Adaptation

To start domain adaptation:

```bash

python run_transformers_training.py

```

The script will:
1. Load the pre-quantized model and dataset
2. Apply optimized training parameters
3. Process the data sequentially
4. Train the model for 3 epochs
5. Save and push checkpoints to Hub regularly

## Using the Model

After training, you can use the domain-adapted model:

```python

from transformers import AutoModelForCausalLM, AutoTokenizer



# Load the domain-adapted model

model_name = "George-API/phi-4-research-assistant"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name, 

                                           device_map="auto",

                                           torch_dtype="bfloat16")



# Generate text

input_text = "The hippocampus is involved in"

inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_length=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

```

## Chat Format Example

Phi-4 works best with its native chat template:

```python

from transformers import pipeline



pipeline = pipeline(

    "text-generation",

    model="George-API/phi-4-research-assistant",

    model_kwargs={"torch_dtype": "bfloat16"},

    device_map="auto",

)



messages = [

    {"role": "system", "content": "You are an expert in cognitive science."},

    {"role": "user", "content": "Explain the role of the hippocampus in memory formation."},

]



outputs = pipeline(messages, max_new_tokens=256)

print(outputs[0]["generated_text"])

```

## Expected Outcomes

After domain adaptation, the model should:
- Have a better understanding of cognitive science terminology
- Show improved performance on domain-specific tasks
- Be ready for supervised fine-tuning in Phase 2

## Next Steps

After completing domain adaptation:
1. Evaluate the model's performance on cognitive science texts
2. Proceed to Phase 2 (Supervised Fine-Tuning)
3. Use TensorBoard to analyze training metrics